SlideShare a Scribd company logo
1 of 10
Download to read offline
12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 1 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
!
Chartpack: Measuring AI (1/3)
The mismeasure of AI: How it began
Azeem Azhar
Hi,
Azeem here. We’re introducing Chartpacks, a new format for
investigating the questions we care about through quant and qual
assessment. Each Chartpack will explore a particular exponential thesis
over three to four weeks.
The first part of each Chartpack will be available to all recipients of the
newsletter. The subsequent parts will be sent to the paying members of
Exponential View.
We’re aiming to produce 13-15 of these a year.
In the first Chartpack, EV team member Nathan Warren explores how
the way we evaluate AI systems has changed and the challenges
posed to it by large language models like ChatGPT.
You can find part 2 and part 3 here.
Part 1 |
The mismeasure of AI: How it began
What if the way we evaluate artificial intelligence was flawed?1 The rapid
rise of ChatGPT and other large language models (LLMs) has left us
struggling to understand where we stand in the AI landscape. Old
standards, like the problematic Turing Test2, are no longer relevant, with
GPT-4's output already being (mostly) indistinguishable from human-
12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 2 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
made text. However, this doesn't mean that it has reached human-level
intelligence, only that it can mimic our outputs. Even OpenAI’s Sam Altman
deemed it "a bad test" for these models.
This leaves us in a predicament. How do we understand the capabilities
and impacts of these models?
AI benchmarks - measurements used to evaluate the performance of
various AI models in a standardised manner - play a crucial role in this
understanding.
Unfortunately, existing benchmarks and evaluation techniques for AI
contain numerous flaws that have been exacerbated with the rise of LLMs.
In this series, we’ll explore the current state of AI evaluation and how
researchers are fixing it to ensure the safe and more measured
development of these models.
Before we move forward, I’d like to thank Exponential View members who
made themselves available to read early drafts and gave their input into
this first Chartpack. In particular, thanks to Ramsay Brown and Rafael
Kaufmann!
12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 3 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
Gary Kasparov in his final match against Deep Blue, New York, 11 May 1997. Photograph: Stan Honda/AFP/Getty
Image
Pawns of progress
By the 1980s, game playing, especially chess, became a centrepiece for
AI research. Chess has long been viewed as a test of intelligence. With
well-defined rules and a finite but computationally complex structure3,
chess presented a challenging yet surmountable problem. The game’s
quantitative rating system, ELO, served as a benchmark for AI researchers
to measure their models’ progress over time. As models improved, they
climbed the ELO rankings, surpassing amateurs, professionals, and
eventually defeating world champion Gary Kasparov in 1997 - a landmark
in AI history.
12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 4 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
A byte-sized shift
Until the last couple of years, researchers tended to design AI systems to
excel at specific tasks such as playing chess, recognizing speech, or
translating languages. These models, called narrow AI, were limited to the
tasks they were designed to perform.
However, the ultimate goal of the discipline since its inception in 1959 has
been to create an AI system that can generalise across tasks, mimic
human intelligence, and create new concepts. This is referred to as
12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 5 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
artificial general intelligence (AGI).
The exact path to AGI was never clear. However, we may have found a
path to achieve progress using data-driven approaches - using large
amounts of data to train and improve AI models. In the 2000s, Microsoft
researched factors influencing AI system performance, particularly in
natural language disambiguation tasks4. Their findings revealed that the
type of model used was less of a factor of performance than the
availability and quality of training data. This insight spurred a shift in AI
research towards data-driven approaches.
12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 6 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
The focus on data-driven approaches led to the development of large-
scale language models trained on vast amounts of data (e.g., GPT-3 was
trained on nearly a trillion words).
To capture the increasingly complex relationships within these datasets,
models required more parameters5, significantly increasing their size.
12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 7 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
Benchmark-busting beasts
The pursuit of larger models has yielded impressive results, with some
even suggesting the recently released GPT-4 is an early version of AGI
(see
Azeem Azhar
discussing GPT-4 capabilities here). But it has also introduced
complications.
LLMs have become so complex that they are difficult to evaluate using
12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 8 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
traditional AI benchmarks designed for narrow tasks. For instance, LLMs
can generate new code, critique arguments, and even understand images.
These capabilities are not evaluated in older benchmarks.
This led to a surge in new natural language processing benchmarks since
2014 as researchers seek more comprehensive measures.
Benchmarks reporting SOTA in the graph refers to the number of benchmarks reporting a new state-of-the-art
performance (SOTA) - a new high score.
LLMs are considered general-purpose technologies with potentially wide-
12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 9 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1
reaching societal and economic ramifications. As a result, it is essential to
have the appropriate evaluative benchmarks to guide and maintain control
over their impact.
In next week’s Chartpack (for members only), we will explore the
challenges of evaluating LLMs and the potential societal
consequences if we fail to address them appropriately.
Share
1
Nathan’s research in this reminded me of Stephen Jay Gould’s
Mismeasure of Man, a book I read nearly 40 years ago. Gould critiques
how measurement of human intelligence was misused to justify biological
determinism and social inequality. - Azeem
2
The Turing test evaluates a machine’s ability to exhibit intelligent
behaviour equivalent to, or indistinguishable from, that of a human.
3
There are an estimated 10^43 board positions.
4
Natural language disambiguation is the process of determining the correct
contextual meaning of a word. For example, “bank” can mean either a
financial institution or the side of a river, depending on the context.
5
Parameters control how a model responds to a prompt, therefore if you
change the parameter, you change the response.
12/11/23, 10:40 AM
Chartpack: Measuring AI (1/3)
Page 10 of 10
https://www.exponentialview.co/p/chartpack-measuring-ai-1

More Related Content

Similar to The mismeasuring of AI: How it all began

IRJET- Machine Learning: Introduction, Algorithms and Implementation
IRJET-  	  Machine Learning: Introduction, Algorithms and ImplementationIRJET-  	  Machine Learning: Introduction, Algorithms and Implementation
IRJET- Machine Learning: Introduction, Algorithms and ImplementationIRJET Journal
 
Hybrid use of machine learning and ontology
Hybrid use of machine learning and ontologyHybrid use of machine learning and ontology
Hybrid use of machine learning and ontologyAnthony (Tony) Sarris
 
Copy of State of AI Report 2023 - ONLINE.pptx
Copy of State of AI Report 2023 - ONLINE.pptxCopy of State of AI Report 2023 - ONLINE.pptx
Copy of State of AI Report 2023 - ONLINE.pptxmpower4ru
 
State of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentationState of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentationssuser2750ef
 
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation TechniquesReview on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniquesijtsrd
 
ai_and_you_slide_template.pptx
ai_and_you_slide_template.pptxai_and_you_slide_template.pptx
ai_and_you_slide_template.pptxganeshjilo
 
SE and AI: a two-way street
SE and AI: a two-way streetSE and AI: a two-way street
SE and AI: a two-way streetCS, NcState
 
Artificial Intelligence power point presentation document
Artificial Intelligence power point presentation documentArtificial Intelligence power point presentation document
Artificial Intelligence power point presentation documentDavid Raj Kanthi
 
SEMANTIC NETWORKS IN AI
SEMANTIC NETWORKS IN AISEMANTIC NETWORKS IN AI
SEMANTIC NETWORKS IN AIIRJET Journal
 
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAnalysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAIRCC Publishing Corporation
 
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAnalysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAIRCC Publishing Corporation
 
Machine Learning on Big Data with HADOOP
Machine Learning on Big Data with HADOOPMachine Learning on Big Data with HADOOP
Machine Learning on Big Data with HADOOPEPAM Systems
 
The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)RR IT Zone
 
Innovation at the Edge_Final
Innovation at the Edge_FinalInnovation at the Edge_Final
Innovation at the Edge_FinalChris Waller
 
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris WallerPistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris WallerPistoia Alliance
 
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR ML
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR MLMITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR ML
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR MLijaia
 
IRJET- Machine Learning
IRJET- Machine LearningIRJET- Machine Learning
IRJET- Machine LearningIRJET Journal
 
How economists should think about the revolutionary changes taking place in h...
How economists should think about the revolutionary changes taking place in h...How economists should think about the revolutionary changes taking place in h...
How economists should think about the revolutionary changes taking place in h...Economic Strategy Institute
 

Similar to The mismeasuring of AI: How it all began (20)

IRJET- Machine Learning: Introduction, Algorithms and Implementation
IRJET-  	  Machine Learning: Introduction, Algorithms and ImplementationIRJET-  	  Machine Learning: Introduction, Algorithms and Implementation
IRJET- Machine Learning: Introduction, Algorithms and Implementation
 
Hybrid use of machine learning and ontology
Hybrid use of machine learning and ontologyHybrid use of machine learning and ontology
Hybrid use of machine learning and ontology
 
Copy of State of AI Report 2023 - ONLINE.pptx
Copy of State of AI Report 2023 - ONLINE.pptxCopy of State of AI Report 2023 - ONLINE.pptx
Copy of State of AI Report 2023 - ONLINE.pptx
 
State of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentationState of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentation
 
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation TechniquesReview on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniques
 
ai_and_you_slide_template.pptx
ai_and_you_slide_template.pptxai_and_you_slide_template.pptx
ai_and_you_slide_template.pptx
 
SE and AI: a two-way street
SE and AI: a two-way streetSE and AI: a two-way street
SE and AI: a two-way street
 
Technovision
TechnovisionTechnovision
Technovision
 
Artificial Intelligence power point presentation document
Artificial Intelligence power point presentation documentArtificial Intelligence power point presentation document
Artificial Intelligence power point presentation document
 
Final Project
Final ProjectFinal Project
Final Project
 
SEMANTIC NETWORKS IN AI
SEMANTIC NETWORKS IN AISEMANTIC NETWORKS IN AI
SEMANTIC NETWORKS IN AI
 
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAnalysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
 
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research PapersAnalysing Chatgpt’s Potential Through the Lens of Creating Research Papers
Analysing Chatgpt’s Potential Through the Lens of Creating Research Papers
 
Machine Learning on Big Data with HADOOP
Machine Learning on Big Data with HADOOPMachine Learning on Big Data with HADOOP
Machine Learning on Big Data with HADOOP
 
The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)The Ultimate Guide to Machine Learning (ML)
The Ultimate Guide to Machine Learning (ML)
 
Innovation at the Edge_Final
Innovation at the Edge_FinalInnovation at the Edge_Final
Innovation at the Edge_Final
 
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris WallerPistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
Pistoia Alliance US Conference 2015 - 1.1.2 Innovation in Pharma - Chris Waller
 
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR ML
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR MLMITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR ML
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR ML
 
IRJET- Machine Learning
IRJET- Machine LearningIRJET- Machine Learning
IRJET- Machine Learning
 
How economists should think about the revolutionary changes taking place in h...
How economists should think about the revolutionary changes taking place in h...How economists should think about the revolutionary changes taking place in h...
How economists should think about the revolutionary changes taking place in h...
 

More from LUMINATIVE MEDIA/PROJECT COUNSEL MEDIA GROUP

Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to Know
Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to KnowWho’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to Know
Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to KnowLUMINATIVE MEDIA/PROJECT COUNSEL MEDIA GROUP
 

More from LUMINATIVE MEDIA/PROJECT COUNSEL MEDIA GROUP (20)

A.I. Has a Measurement Problem: Can It Be Solved?
A.I. Has a Measurement Problem: Can It Be Solved?A.I. Has a Measurement Problem: Can It Be Solved?
A.I. Has a Measurement Problem: Can It Be Solved?
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 
Why democracy dies in Trumpian boredom (by Edward Luce)
Why democracy dies in Trumpian boredom (by Edward Luce)Why democracy dies in Trumpian boredom (by Edward Luce)
Why democracy dies in Trumpian boredom (by Edward Luce)
 
An interview with the director of "Zone of Interest"
An interview with the director of "Zone of Interest"An interview with the director of "Zone of Interest"
An interview with the director of "Zone of Interest"
 
"Schindler’s List" : an oral history with the actors
"Schindler’s List" : an oral history with the actors"Schindler’s List" : an oral history with the actors
"Schindler’s List" : an oral history with the actors
 
Chinese Startup 01.AI Is Winning the Open Source AI Race
Chinese Startup 01.AI Is Winning the Open Source AI RaceChinese Startup 01.AI Is Winning the Open Source AI Race
Chinese Startup 01.AI Is Winning the Open Source AI Race
 
Google’s Gemini Marketing Trick: what a trickster!
Google’s Gemini Marketing Trick: what a trickster!Google’s Gemini Marketing Trick: what a trickster!
Google’s Gemini Marketing Trick: what a trickster!
 
Inside the Magical World of AI Prompters on Reddit
Inside the Magical World of AI Prompters on RedditInside the Magical World of AI Prompters on Reddit
Inside the Magical World of AI Prompters on Reddit
 
Regulators blame Bezos for making Amazon worse in new lawsuit details
Regulators blame Bezos for making Amazon worse in new lawsuit detailsRegulators blame Bezos for making Amazon worse in new lawsuit details
Regulators blame Bezos for making Amazon worse in new lawsuit details
 
Bariatric Surgery at 16
Bariatric Surgery at 16Bariatric Surgery at 16
Bariatric Surgery at 16
 
Palestinians Claim Social Media 'Censorship' Is Endangering Lives
Palestinians Claim Social Media 'Censorship' Is Endangering LivesPalestinians Claim Social Media 'Censorship' Is Endangering Lives
Palestinians Claim Social Media 'Censorship' Is Endangering Lives
 
Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to Know
Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to KnowWho’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to Know
Who’s Responsible for the Gaza Hospital Explosion? Here’s Why It’s Hard to Know
 
Why ChatGPT Is Getting Dumber at Basic Math
Why ChatGPT Is Getting Dumber at Basic MathWhy ChatGPT Is Getting Dumber at Basic Math
Why ChatGPT Is Getting Dumber at Basic Math
 
U.S. and E.U. Finalize Long-Awaited Deal on Sharing Data
U.S. and E.U. Finalize Long-Awaited Deal on Sharing DataU.S. and E.U. Finalize Long-Awaited Deal on Sharing Data
U.S. and E.U. Finalize Long-Awaited Deal on Sharing Data
 
Will A.I. Become the New McKinsey?
Will A.I. Become the New McKinsey?Will A.I. Become the New McKinsey?
Will A.I. Become the New McKinsey?
 
AI is already writing books, websites and online recipes
AI is already writing books, websites and online recipesAI is already writing books, websites and online recipes
AI is already writing books, websites and online recipes
 
What happens when ChatGPT lies about real people?
What happens when ChatGPT lies about real people?What happens when ChatGPT lies about real people?
What happens when ChatGPT lies about real people?
 
The Brilliant Inventor Who Made Two of History’s Biggest Mistakes
The Brilliant Inventor Who Made Two of History’s Biggest MistakesThe Brilliant Inventor Who Made Two of History’s Biggest Mistakes
The Brilliant Inventor Who Made Two of History’s Biggest Mistakes
 
Wirecard fraudster Jan Marsalek’s grandfather was suspected Russian spy
Wirecard fraudster Jan Marsalek’s grandfather was suspected Russian spyWirecard fraudster Jan Marsalek’s grandfather was suspected Russian spy
Wirecard fraudster Jan Marsalek’s grandfather was suspected Russian spy
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

The mismeasuring of AI: How it all began

  • 1. 12/11/23, 10:40 AM Chartpack: Measuring AI (1/3) Page 1 of 10 https://www.exponentialview.co/p/chartpack-measuring-ai-1 ! Chartpack: Measuring AI (1/3) The mismeasure of AI: How it began Azeem Azhar Hi, Azeem here. We’re introducing Chartpacks, a new format for investigating the questions we care about through quant and qual assessment. Each Chartpack will explore a particular exponential thesis over three to four weeks. The first part of each Chartpack will be available to all recipients of the newsletter. The subsequent parts will be sent to the paying members of Exponential View. We’re aiming to produce 13-15 of these a year. In the first Chartpack, EV team member Nathan Warren explores how the way we evaluate AI systems has changed and the challenges posed to it by large language models like ChatGPT. You can find part 2 and part 3 here. Part 1 | The mismeasure of AI: How it began What if the way we evaluate artificial intelligence was flawed?1 The rapid rise of ChatGPT and other large language models (LLMs) has left us struggling to understand where we stand in the AI landscape. Old standards, like the problematic Turing Test2, are no longer relevant, with GPT-4's output already being (mostly) indistinguishable from human-
  • 2. 12/11/23, 10:40 AM Chartpack: Measuring AI (1/3) Page 2 of 10 https://www.exponentialview.co/p/chartpack-measuring-ai-1 made text. However, this doesn't mean that it has reached human-level intelligence, only that it can mimic our outputs. Even OpenAI’s Sam Altman deemed it "a bad test" for these models. This leaves us in a predicament. How do we understand the capabilities and impacts of these models? AI benchmarks - measurements used to evaluate the performance of various AI models in a standardised manner - play a crucial role in this understanding. Unfortunately, existing benchmarks and evaluation techniques for AI contain numerous flaws that have been exacerbated with the rise of LLMs. In this series, we’ll explore the current state of AI evaluation and how researchers are fixing it to ensure the safe and more measured development of these models. Before we move forward, I’d like to thank Exponential View members who made themselves available to read early drafts and gave their input into this first Chartpack. In particular, thanks to Ramsay Brown and Rafael Kaufmann!
  • 3. 12/11/23, 10:40 AM Chartpack: Measuring AI (1/3) Page 3 of 10 https://www.exponentialview.co/p/chartpack-measuring-ai-1 Gary Kasparov in his final match against Deep Blue, New York, 11 May 1997. Photograph: Stan Honda/AFP/Getty Image Pawns of progress By the 1980s, game playing, especially chess, became a centrepiece for AI research. Chess has long been viewed as a test of intelligence. With well-defined rules and a finite but computationally complex structure3, chess presented a challenging yet surmountable problem. The game’s quantitative rating system, ELO, served as a benchmark for AI researchers to measure their models’ progress over time. As models improved, they climbed the ELO rankings, surpassing amateurs, professionals, and eventually defeating world champion Gary Kasparov in 1997 - a landmark in AI history.
  • 4. 12/11/23, 10:40 AM Chartpack: Measuring AI (1/3) Page 4 of 10 https://www.exponentialview.co/p/chartpack-measuring-ai-1 A byte-sized shift Until the last couple of years, researchers tended to design AI systems to excel at specific tasks such as playing chess, recognizing speech, or translating languages. These models, called narrow AI, were limited to the tasks they were designed to perform. However, the ultimate goal of the discipline since its inception in 1959 has been to create an AI system that can generalise across tasks, mimic human intelligence, and create new concepts. This is referred to as
  • 5. 12/11/23, 10:40 AM Chartpack: Measuring AI (1/3) Page 5 of 10 https://www.exponentialview.co/p/chartpack-measuring-ai-1 artificial general intelligence (AGI). The exact path to AGI was never clear. However, we may have found a path to achieve progress using data-driven approaches - using large amounts of data to train and improve AI models. In the 2000s, Microsoft researched factors influencing AI system performance, particularly in natural language disambiguation tasks4. Their findings revealed that the type of model used was less of a factor of performance than the availability and quality of training data. This insight spurred a shift in AI research towards data-driven approaches.
  • 6. 12/11/23, 10:40 AM Chartpack: Measuring AI (1/3) Page 6 of 10 https://www.exponentialview.co/p/chartpack-measuring-ai-1 The focus on data-driven approaches led to the development of large- scale language models trained on vast amounts of data (e.g., GPT-3 was trained on nearly a trillion words). To capture the increasingly complex relationships within these datasets, models required more parameters5, significantly increasing their size.
  • 7. 12/11/23, 10:40 AM Chartpack: Measuring AI (1/3) Page 7 of 10 https://www.exponentialview.co/p/chartpack-measuring-ai-1 Benchmark-busting beasts The pursuit of larger models has yielded impressive results, with some even suggesting the recently released GPT-4 is an early version of AGI (see Azeem Azhar discussing GPT-4 capabilities here). But it has also introduced complications. LLMs have become so complex that they are difficult to evaluate using
  • 8. 12/11/23, 10:40 AM Chartpack: Measuring AI (1/3) Page 8 of 10 https://www.exponentialview.co/p/chartpack-measuring-ai-1 traditional AI benchmarks designed for narrow tasks. For instance, LLMs can generate new code, critique arguments, and even understand images. These capabilities are not evaluated in older benchmarks. This led to a surge in new natural language processing benchmarks since 2014 as researchers seek more comprehensive measures. Benchmarks reporting SOTA in the graph refers to the number of benchmarks reporting a new state-of-the-art performance (SOTA) - a new high score. LLMs are considered general-purpose technologies with potentially wide-
  • 9. 12/11/23, 10:40 AM Chartpack: Measuring AI (1/3) Page 9 of 10 https://www.exponentialview.co/p/chartpack-measuring-ai-1 reaching societal and economic ramifications. As a result, it is essential to have the appropriate evaluative benchmarks to guide and maintain control over their impact. In next week’s Chartpack (for members only), we will explore the challenges of evaluating LLMs and the potential societal consequences if we fail to address them appropriately. Share 1 Nathan’s research in this reminded me of Stephen Jay Gould’s Mismeasure of Man, a book I read nearly 40 years ago. Gould critiques how measurement of human intelligence was misused to justify biological determinism and social inequality. - Azeem 2 The Turing test evaluates a machine’s ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. 3 There are an estimated 10^43 board positions. 4 Natural language disambiguation is the process of determining the correct contextual meaning of a word. For example, “bank” can mean either a financial institution or the side of a river, depending on the context. 5 Parameters control how a model responds to a prompt, therefore if you change the parameter, you change the response.
  • 10. 12/11/23, 10:40 AM Chartpack: Measuring AI (1/3) Page 10 of 10 https://www.exponentialview.co/p/chartpack-measuring-ai-1