On the Actual Capabilities of Large Language Models for
Producing Incremental Research Papers:
A Revised Assessment After Early GPT-5 Scientific Results
December 2025 Edition
December 12, 2025
Abstract
This updated paper revisits—and partially revises—an assessment written only a month
earlier evaluating the ability of Large Language Models (LLMs) to contribute to incremental
research in mathematics and theoretical physics. Substantial new evidence, particularly the
early scientific case studies of GPT-5 reported by OpenAI, demonstrates materially improved
capabilities relative to the assumptions underlying the previous analysis. This revision clarifies
which limitations of earlier models still apply, which have been weakened by the new results,
and what the new empirical frontier implies for cautious, well-structured human–LLM scientific
collaboration. The tone remains conservative: the improvements are significant, but they do not
eliminate the need for expert oversight, rigorous verification, or principled scientific governance.
1 Introduction
A month ago, the conservative view—including my own—was that LLMs could not reliably con-
tribute to incremental mathematical or theoretical-physics research in the way human researchers
do. The earlier assessment argued that LLMs produced plausible mathematical text, not verified
argumentation; that they failed to maintain global logical coherence; and that any appearance of
progress was an artifact of pattern-matching, not reasoning.
Between that assessment and this revision, the release of the GPT-5 Early Science Acceleration
report and associated experiments has meaningfully shifted the empirical foundation. The results
do not overturn the fundamental epistemic concerns, but they do show that the earlier paper was
too conservative in several specific claims—particularly regarding:
• the ability to generate new correct proofs,
• the ability to perform deep semantic literature search,
• the ability to identify non-obvious mathematical or physical structures,
• and the ability to execute mechanistic biological reasoning grounded in data.
This new edition provides an updated and more accurate characterization of what
LLMs (specifically GPT-5-class systems) can presently achieve, emphasizing rigor, verifica-
tion, and boundaries of validity. The early science acceleration results are documented
in the OpenAI report. https://cdn.openai.com/pdf/4a25f921-e4e0-479a-9b38-5367b47e8fd0/
early-science-acceleration-experiments-with-gpt-5.pdf.
1
2 What Incremental Papers Require (Unchanged Core Require-
ments)
Incremental research papers typically involve:
1. Improving constants or tightening inequalities,
2. Extending known theorems to broader classes,
3. Identifying new symmetry structures,
4. Constructing new proofs of existing results,
5. Developing or analyzing Lagrangians, functionals, or equations,
6. Providing mechanistic or causal explanations in scientific domains,
7. Verifying that results are novel and not pre-empted by prior literature.
These requirements remain unchanged, and they remain non-negotiable. A scientifically valid
incremental paper must be correct, reproducible, and precisely located within the existing literature.
3 What the Earlier Assessment Claimed—and What Has Now
Changed
3.1 Claim 1 (Now Revised): LLMs cannot produce new correct proofs or im-
prove bounds
Revised assessment: GPT-5 has now produced multiple independently verified novel or semi-
novel mathematical results. The most striking example is an experiment in which GPT-5 improved
a step-size constant from 1/L to 1.5/L in a nontrivial convex optimization context, producing a
correct and previously unpublished intermediate proof (see the GPT-5 Early Science Acceleration
report).
This contradicts the earlier claim that LLMs cannot sustain globally valid derivations. The new
result shows that, under sustained internal reasoning with expert scaffolding and human vetting of
final steps, GPT-5 can perform authentic, nontrivial mathematical argumentation.
3.2 Claim 2 (Now Revised): LLMs cannot discover deep structural insights
(e.g., symmetries)
Revised assessment: Experiments documented in the GPT-5 Early Science Acceleration report
show that GPT-5, after warm-up on related differential equations, reproduced the full SL(2, R)
symmetry generators of a curved-space PDE governing black hole tidal response. Producing such
a nontrivial structural result directly contradicts the earlier assumption that LLMs lack the ability
to identify invariant structures in advanced mathematics or physics.
3.3 Claim 3 (Now Revised): LLMs cannot perform deep literature discovery
Revised assessment: The report demonstrates that GPT-5 can locate obscure mathematical
results from mid-20th-century literature, including sources in German and nonstandard mathe-
matical vocabularies, effectively unifying semantically equivalent formulations across decades of
2
research. This exceeds the retrieval horizon assumed in the earlier paper and suggests LLMs may
now serve as semantic unifiers of mathematical literature. However, hallucination risk remains;
human confirmation is still required.
3.4 Claim 4 (Now Revised): LLMs can only produce narrative gloss in biology
Revised assessment: GPT-5’s analysis of unpublished flow cytometry datasets identified the
correct mechanism of Th17 skew due to N-linked glycosylation impairment, predicted outcomes
of mannose rescue experiments, and integrated multiple signaling pathways into a coherent causal
explanation that matched later experimental outcomes. This indicates that LLMs can now support
mechanistic biological reasoning grounded in real datasets—not merely generate plausible explana-
tory prose.
4 What Has Not Changed: Structural Limitations That Still Hold
Despite substantial gains, several aspects of the earlier analysis remain true:
• Verification is still essential—LLMs occasionally produce subtle errors or overconfident asser-
tions.
• Reproducibility is imperfect—identical prompts may lead to different internal reasoning paths.
• The formalization gap persists—LLMs do not yet simultaneously produce verifiable Lean/Coq
proofs alongside informal argumentation.
• Autonomous research remains out of reach—LLMs do not independently choose scientifically
valuable problems or reliably evaluate novelty without human guidance.
• Scientific epistemology has not changed—falsifiability, reproducibility, and empirical valida-
tion remain core to valid scientific claims.
5 Updated Model of LLM–Human Collaboration (2025)
Given the new empirical evidence, a more accurate characterization is:
LLMs can now play the role of structured research companions, capable of producing candidate
proofs, mechanistic hypotheses, literature integrations, and data interpretations that may reach co-
author–level significance, provided human experts verify and refine the results.
This represents a meaningful shift from the earlier framing that limited LLMs to a drafting
assistant role.
6 Conservative Outlook: What Comes Next
The following developments are plausible but not guaranteed. They represent conservative,
evidence-based extrapolations rather than speculative leaps:
• Tight human–LLM iterative mathematics: Workflows where humans guide LLMs
through multiple derivation paths, select candidates, and utilize symbolic verification
pipelines, thereby compressing weeks of work into hours.
3
• Formal–informal dual proofs: Systems that generate both human-readable proofs and
formal Lean/Isabelle artifacts side by side.
• Semantic “living literature” systems: LLMs that maintain continuously updated maps
of the literature to detect prior results, equivalent formulations, and potential gaps.
• Mechanism-discovery assistants in biology and physics: LLMs that integrate data
with mechanistic explanations, propose new experimental branches, and interpret multi-
dimensional datasets, always with human oversight.
• Autonomous self-checking modules: Models equipped with adversarial reasoning loops,
counterexample generation, and uncertainty quantification to reduce hallucination risks and
improve reliability.
7 Conclusion
The earlier assessment—written only weeks before the release of compelling GPT-5 scientific re-
sults—was accurate for GPT-4–class systems but overly conservative for GPT-5. The advances
of late 2025 show that LLMs can contribute substantively to incremental research under guided
conditions.
However, the epistemic core remains:
LLMs amplify scientific reasoning; they do not replace scientific rigor. They propose; humans
verify. They accelerate; humans validate.
The future of scientific research will increasingly be co-creative, with LLMs serving as struc-
tured collaborators that extend human reach—but never eliminate the indispensable role of expert
judgment, formal verification, and empirical grounding.
4

Update: Generating Incremental Research Papers.pdf

  • 1.
    On the ActualCapabilities of Large Language Models for Producing Incremental Research Papers: A Revised Assessment After Early GPT-5 Scientific Results December 2025 Edition December 12, 2025 Abstract This updated paper revisits—and partially revises—an assessment written only a month earlier evaluating the ability of Large Language Models (LLMs) to contribute to incremental research in mathematics and theoretical physics. Substantial new evidence, particularly the early scientific case studies of GPT-5 reported by OpenAI, demonstrates materially improved capabilities relative to the assumptions underlying the previous analysis. This revision clarifies which limitations of earlier models still apply, which have been weakened by the new results, and what the new empirical frontier implies for cautious, well-structured human–LLM scientific collaboration. The tone remains conservative: the improvements are significant, but they do not eliminate the need for expert oversight, rigorous verification, or principled scientific governance. 1 Introduction A month ago, the conservative view—including my own—was that LLMs could not reliably con- tribute to incremental mathematical or theoretical-physics research in the way human researchers do. The earlier assessment argued that LLMs produced plausible mathematical text, not verified argumentation; that they failed to maintain global logical coherence; and that any appearance of progress was an artifact of pattern-matching, not reasoning. Between that assessment and this revision, the release of the GPT-5 Early Science Acceleration report and associated experiments has meaningfully shifted the empirical foundation. The results do not overturn the fundamental epistemic concerns, but they do show that the earlier paper was too conservative in several specific claims—particularly regarding: • the ability to generate new correct proofs, • the ability to perform deep semantic literature search, • the ability to identify non-obvious mathematical or physical structures, • and the ability to execute mechanistic biological reasoning grounded in data. This new edition provides an updated and more accurate characterization of what LLMs (specifically GPT-5-class systems) can presently achieve, emphasizing rigor, verifica- tion, and boundaries of validity. The early science acceleration results are documented in the OpenAI report. https://cdn.openai.com/pdf/4a25f921-e4e0-479a-9b38-5367b47e8fd0/ early-science-acceleration-experiments-with-gpt-5.pdf. 1
  • 2.
    2 What IncrementalPapers Require (Unchanged Core Require- ments) Incremental research papers typically involve: 1. Improving constants or tightening inequalities, 2. Extending known theorems to broader classes, 3. Identifying new symmetry structures, 4. Constructing new proofs of existing results, 5. Developing or analyzing Lagrangians, functionals, or equations, 6. Providing mechanistic or causal explanations in scientific domains, 7. Verifying that results are novel and not pre-empted by prior literature. These requirements remain unchanged, and they remain non-negotiable. A scientifically valid incremental paper must be correct, reproducible, and precisely located within the existing literature. 3 What the Earlier Assessment Claimed—and What Has Now Changed 3.1 Claim 1 (Now Revised): LLMs cannot produce new correct proofs or im- prove bounds Revised assessment: GPT-5 has now produced multiple independently verified novel or semi- novel mathematical results. The most striking example is an experiment in which GPT-5 improved a step-size constant from 1/L to 1.5/L in a nontrivial convex optimization context, producing a correct and previously unpublished intermediate proof (see the GPT-5 Early Science Acceleration report). This contradicts the earlier claim that LLMs cannot sustain globally valid derivations. The new result shows that, under sustained internal reasoning with expert scaffolding and human vetting of final steps, GPT-5 can perform authentic, nontrivial mathematical argumentation. 3.2 Claim 2 (Now Revised): LLMs cannot discover deep structural insights (e.g., symmetries) Revised assessment: Experiments documented in the GPT-5 Early Science Acceleration report show that GPT-5, after warm-up on related differential equations, reproduced the full SL(2, R) symmetry generators of a curved-space PDE governing black hole tidal response. Producing such a nontrivial structural result directly contradicts the earlier assumption that LLMs lack the ability to identify invariant structures in advanced mathematics or physics. 3.3 Claim 3 (Now Revised): LLMs cannot perform deep literature discovery Revised assessment: The report demonstrates that GPT-5 can locate obscure mathematical results from mid-20th-century literature, including sources in German and nonstandard mathe- matical vocabularies, effectively unifying semantically equivalent formulations across decades of 2
  • 3.
    research. This exceedsthe retrieval horizon assumed in the earlier paper and suggests LLMs may now serve as semantic unifiers of mathematical literature. However, hallucination risk remains; human confirmation is still required. 3.4 Claim 4 (Now Revised): LLMs can only produce narrative gloss in biology Revised assessment: GPT-5’s analysis of unpublished flow cytometry datasets identified the correct mechanism of Th17 skew due to N-linked glycosylation impairment, predicted outcomes of mannose rescue experiments, and integrated multiple signaling pathways into a coherent causal explanation that matched later experimental outcomes. This indicates that LLMs can now support mechanistic biological reasoning grounded in real datasets—not merely generate plausible explana- tory prose. 4 What Has Not Changed: Structural Limitations That Still Hold Despite substantial gains, several aspects of the earlier analysis remain true: • Verification is still essential—LLMs occasionally produce subtle errors or overconfident asser- tions. • Reproducibility is imperfect—identical prompts may lead to different internal reasoning paths. • The formalization gap persists—LLMs do not yet simultaneously produce verifiable Lean/Coq proofs alongside informal argumentation. • Autonomous research remains out of reach—LLMs do not independently choose scientifically valuable problems or reliably evaluate novelty without human guidance. • Scientific epistemology has not changed—falsifiability, reproducibility, and empirical valida- tion remain core to valid scientific claims. 5 Updated Model of LLM–Human Collaboration (2025) Given the new empirical evidence, a more accurate characterization is: LLMs can now play the role of structured research companions, capable of producing candidate proofs, mechanistic hypotheses, literature integrations, and data interpretations that may reach co- author–level significance, provided human experts verify and refine the results. This represents a meaningful shift from the earlier framing that limited LLMs to a drafting assistant role. 6 Conservative Outlook: What Comes Next The following developments are plausible but not guaranteed. They represent conservative, evidence-based extrapolations rather than speculative leaps: • Tight human–LLM iterative mathematics: Workflows where humans guide LLMs through multiple derivation paths, select candidates, and utilize symbolic verification pipelines, thereby compressing weeks of work into hours. 3
  • 4.
    • Formal–informal dualproofs: Systems that generate both human-readable proofs and formal Lean/Isabelle artifacts side by side. • Semantic “living literature” systems: LLMs that maintain continuously updated maps of the literature to detect prior results, equivalent formulations, and potential gaps. • Mechanism-discovery assistants in biology and physics: LLMs that integrate data with mechanistic explanations, propose new experimental branches, and interpret multi- dimensional datasets, always with human oversight. • Autonomous self-checking modules: Models equipped with adversarial reasoning loops, counterexample generation, and uncertainty quantification to reduce hallucination risks and improve reliability. 7 Conclusion The earlier assessment—written only weeks before the release of compelling GPT-5 scientific re- sults—was accurate for GPT-4–class systems but overly conservative for GPT-5. The advances of late 2025 show that LLMs can contribute substantively to incremental research under guided conditions. However, the epistemic core remains: LLMs amplify scientific reasoning; they do not replace scientific rigor. They propose; humans verify. They accelerate; humans validate. The future of scientific research will increasingly be co-creative, with LLMs serving as struc- tured collaborators that extend human reach—but never eliminate the indispensable role of expert judgment, formal verification, and empirical grounding. 4