How to Troubleshoot Apps for the Modern Connected Worker
The mismeasuring of AI: How it all began
1. 12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 1 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
!
Chartpack: Measuring AI (1/3)
The mismeasure of AI: How it began
Azeem Azhar
Hi,
Azeem here. We’re introducing Chartpacks, a new format for
investigating the questions we care about through quant and qual
assessment. Each Chartpack will explore a particular exponential thesis
over three to four weeks.
The first part of each Chartpack will be available to all recipients of the
newsletter. The subsequent parts will be sent to the paying members of
Exponential View.
We’re aiming to produce 13-15 of these a year.
In the first Chartpack, EV team member Nathan Warren explores how
the way we evaluate AI systems has changed and the challenges
posed to it by large language models like ChatGPT.
You can find part 2 and part 3 here.
Part 1 |
The mismeasure of AI: How it began
What if the way we evaluate artificial intelligence was flawed?1 The rapid
rise of ChatGPT and other large language models (LLMs) has left us
struggling to understand where we stand in the AI landscape. Old
standards, like the problematic Turing Test2, are no longer relevant, with
GPT-4's output already being (mostly) indistinguishable from human-
2. 12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 2 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
made text. However, this doesn't mean that it has reached human-level
intelligence, only that it can mimic our outputs. Even OpenAI’s Sam Altman
deemed it "a bad test" for these models.
This leaves us in a predicament. How do we understand the capabilities
and impacts of these models?
AI benchmarks - measurements used to evaluate the performance of
various AI models in a standardised manner - play a crucial role in this
understanding.
Unfortunately, existing benchmarks and evaluation techniques for AI
contain numerous flaws that have been exacerbated with the rise of LLMs.
In this series, we’ll explore the current state of AI evaluation and how
researchers are fixing it to ensure the safe and more measured
development of these models.
Before we move forward, I’d like to thank Exponential View members who
made themselves available to read early drafts and gave their input into
this first Chartpack. In particular, thanks to Ramsay Brown and Rafael
Kaufmann!
3. 12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 3 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
Gary Kasparov in his final match against Deep Blue, New York, 11 May 1997. Photograph: Stan Honda/AFP/Getty
Image
Pawns of progress
By the 1980s, game playing, especially chess, became a centrepiece for
AI research. Chess has long been viewed as a test of intelligence. With
well-defined rules and a finite but computationally complex structure3,
chess presented a challenging yet surmountable problem. The game’s
quantitative rating system, ELO, served as a benchmark for AI researchers
to measure their models’ progress over time. As models improved, they
climbed the ELO rankings, surpassing amateurs, professionals, and
eventually defeating world champion Gary Kasparov in 1997 - a landmark
in AI history.
4. 12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 4 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
A byte-sized shift
Until the last couple of years, researchers tended to design AI systems to
excel at specific tasks such as playing chess, recognizing speech, or
translating languages. These models, called narrow AI, were limited to the
tasks they were designed to perform.
However, the ultimate goal of the discipline since its inception in 1959 has
been to create an AI system that can generalise across tasks, mimic
human intelligence, and create new concepts. This is referred to as
5. 12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 5 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
artificial general intelligence (AGI).
The exact path to AGI was never clear. However, we may have found a
path to achieve progress using data-driven approaches - using large
amounts of data to train and improve AI models. In the 2000s, Microsoft
researched factors influencing AI system performance, particularly in
natural language disambiguation tasks4. Their findings revealed that the
type of model used was less of a factor of performance than the
availability and quality of training data. This insight spurred a shift in AI
research towards data-driven approaches.
6. 12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 6 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
The focus on data-driven approaches led to the development of large-
scale language models trained on vast amounts of data (e.g., GPT-3 was
trained on nearly a trillion words).
To capture the increasingly complex relationships within these datasets,
models required more parameters5, significantly increasing their size.
7. 12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 7 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
Benchmark-busting beasts
The pursuit of larger models has yielded impressive results, with some
even suggesting the recently released GPT-4 is an early version of AGI
(see
Azeem Azhar
discussing GPT-4 capabilities here). But it has also introduced
complications.
LLMs have become so complex that they are difficult to evaluate using
8. 12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 8 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
traditional AI benchmarks designed for narrow tasks. For instance, LLMs
can generate new code, critique arguments, and even understand images.
These capabilities are not evaluated in older benchmarks.
This led to a surge in new natural language processing benchmarks since
2014 as researchers seek more comprehensive measures.
Benchmarks reporting SOTA in the graph refers to the number of benchmarks reporting a new state-of-the-art
performance (SOTA) - a new high score.
LLMs are considered general-purpose technologies with potentially wide-
9. 12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 9 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
reaching societal and economic ramifications. As a result, it is essential to
have the appropriate evaluative benchmarks to guide and maintain control
over their impact.
In next week’s Chartpack (for members only), we will explore the
challenges of evaluating LLMs and the potential societal
consequences if we fail to address them appropriately.
Share
1
Nathan’s research in this reminded me of Stephen Jay Gould’s
Mismeasure of Man, a book I read nearly 40 years ago. Gould critiques
how measurement of human intelligence was misused to justify biological
determinism and social inequality. - Azeem
2
The Turing test evaluates a machine’s ability to exhibit intelligent
behaviour equivalent to, or indistinguishable from, that of a human.
3
There are an estimated 10^43 board positions.
4
Natural language disambiguation is the process of determining the correct
contextual meaning of a word. For example, “bank” can mean either a
financial institution or the side of a river, depending on the context.
5
Parameters control how a model responds to a prompt, therefore if you
change the parameter, you change the response.
10. 12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 10 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1