Over the past twelve months, OpenAI has consistently expanded its model zoo: separate models for code, for reasoning, for agents. GPT-5.4, released on March 5, 2026, breaks this logic — it unifies everything into a single model. But is this an architectural decision or an operational simplification?
Spoiler: it's both — and that's precisely why it's worth understanding the mechanics of this transition before deciding to use the model in production.
⚡ TLDR
- ✅ Consolidation: GPT-5.4 replaces both gpt-5.2 (general model) and gpt-5.3-codex (code model), unifying them into a single inference pipeline
- ✅ Reasoning as a parameter: instead of separate models — a single
reasoning.effort parameter from none to xhigh, which controls the number of reasoning tokens - ✅ CoT between turns: for the first time in a mainline model, chain-of-thought transfer between requests is supported via the Responses API — this reduces latency and increases cache hit rate
- 🎯 You will gain: an understanding of how architectural decisions in GPT-5.4 impact model selection, inference configuration, and cost in real-world scenarios
- 👇 Below — detailed explanations, examples, and tables
📚 Article Contents
🎯 Why OpenAI Abandoned Separate Models: The Engineering Logic of Consolidation
Why consolidation?
OpenAI approached GPT-5.4 due to accumulated operational debt: parallel support for separate
code models (GPT-5-Codex → GPT-5.1-Codex → GPT-5.2-Codex → GPT-5.3-Codex) alongside
general models (GPT-5.2, GPT-5.3 Instant) created fragmentation at the API level, documentation,
and choice for developers. GPT-5.4 eliminates this fragmentation by combining the coding capabilities
of GPT-5.3-Codex with general reasoning in a single model — the first mainline model where this
merger occurred at the training stack level, not routing logic.
The choice between a "general" and a "code" model is cognitive overhead for the developer.
Consolidated architecture shifts this choice from "which model to use"
to "which parameter to set."
How the Model Zoo Looked Before GPT-5.4
To understand the logic of consolidation, one needs to reconstruct the chronology of fragmentation. Throughout
2025–2026, OpenAI simultaneously developed two independent tracks:
General track: GPT-5 (May 2025) → GPT-5.1 (August 2025) →
GPT-5.2 (December 2025) → GPT-5.3 Instant (February 2026). Each version had sub-variants:
Instant (low latency), Thinking (extended reasoning), Pro (maximum compute).
Codex track: GPT-5-Codex (May 2025) → GPT-5.1-Codex → GPT-5.1-Codex-Max
(with context compaction) → GPT-5.2-Codex (January 14, 2026) → GPT-5.3-Codex (February 2026).
Codex models were optimized for agentic coding: long task horizons, compaction,
SWE-bench-oriented training.
According to
OpenAI Model Release Notes, GPT-5.3-Codex was the first model where Codex and GPT-5
training stacks were unified: "the first model combining Codex + GPT-5 training
stacks — bringing together best-in-class code generation, reasoning, and general-purpose
intelligence in one unified model". GPT-5.4 completes this process,
becoming the default model for all surfaces: ChatGPT, API, and Codex simultaneously.
Specific Costs of Fragmentation for Developers
Fragmentation between tracks had real operational consequences. A developer building an agentic
workflow with mixed tasks (planning, code writing, document analysis, spreadsheet work)
faced several problems simultaneously.
Routing between models within the pipeline. The standard architecture included
GPT-5.2 for planning and reasoning steps and GPT-5.2-Codex for generation steps.
Each switch between models meant a separate API call, separate context management,
potentially different parameters, and different behavior in edge cases. For long agentic
trajectories, this accumulated into significant overhead — both in latency and code complexity.
Discrepancies in documentation and behavior.
Codex Models page
and the general API documentation were separate resources with different recommendations:
"Use GPT-5-codex for coding-focused work in Codex, or Codex-like environments;
use GPT-5 for general, non-coding tasks" — a direct instruction to the developer
to think about model selection at each step of the workflow.
Context compaction as an exclusive feature. GPT-5.1-Codex-Max introduced context
compaction — a mechanism for compressing long agentic contexts without losing key information.
However, this feature was only available in Codex models. Developers on general GPT-5.x
solved the context overflow problem with their own summarization mechanisms, which added
logic and tokens. GPT-5.4 introduces native compaction for all scenarios.
What "Merging Training Stacks" Means in Practice
According to
OpenAI's official announcement, GPT-5.4 is "the first mainline reasoning model that
incorporates the frontier coding capabilities of GPT-5.3-Codex". This is architecturally
different from a simple "better prompt" or post-hoc routing:
Shared weights for coding and reasoning. Both capability tracks live
in a single set of weights — the result of joint training on coding and general-purpose data,
not an ensemble or routing between separate models.
Unified context window for mixed tasks. When working with an agentic
workflow where planning (reasoning-heavy) and implementation (coding-heavy) steps alternate,
the context is not broken between models — the model maintains the full task state.
Tool search as a bridge between capability tracks.
OpenAI
describes tool search as a mechanism that "helps agents find and use
the right tools more effectively without loss of intelligence" — this is possible precisely because
the model equally well understands both the general task context and the technical details
of the tools.
Deprecation Strategy as an Indicator of Priorities
The deprecation pattern confirms that consolidation is a strategic, not tactical, decision.
According to OpenAI's data,
GPT-5.2 Thinking will be available for another three months in the Legacy Models section and will be disabled
on June 6, 2026. Meanwhile, GPT-5.1, GPT-5, and GPT-4.1 in the API have not yet received a deprecation date
—
OpenAI explicitly stated: "no current plans to deprecate" for API versions.
This means: OpenAI deliberately keeps older models in the API for those who need lower
cost or specific behavior, but in the product layer (ChatGPT, Codex) consistently
transitions to a consolidated approach. For developers, this is a practical signal:
new integrations should be built for GPT-5.4, while legacy systems with GPT-5.1/4.1 can
be migrated without haste.
From "Which Model" to "Which Parameter"
Consolidation shifts complexity from model selection to parameter configuration. Instead of
the decision "GPT-5.2 or GPT-5.3-Codex," it's now the decision "which reasoning.effort
and whether I need the Responses API with CoT between turns."
This is not a simplification as such — the level of complexity remains, but changes form. The advantage:
configuration parameters are versioned, documented, and predictable within a single model ID.
Choosing between two different models with different training data and different behavior
on edge cases is significantly greater uncertainty than choosing between effort=high
and effort=xhigh within a single graph.
🎯 How the Reasoning Pipeline Changed from GPT-5.0 to 5.4: Router, Thinking Track, and Steerability
Evolution of Reasoning in GPT-5.x
The key change between GPT-5.x versions is the transition from a binary "reasoning on/off" to granular control via the reasoning.effort parameter with five levels: none, low, medium, high, xhigh. GPT-5.4 also introduces support for transferring chain-of-thought between turns via the Responses API, which fundamentally changes the model's behavior in multi-turn agentic scenarios.
Reasoning effort is not just "think more or less." It's direct control over the number of reasoning tokens the model generates before responding, with all the consequences for latency and cost.
The early GPT-5 architecture (version 5.0) had hard-coded logic: the model either entered reasoning mode or it didn't. GPT-5.1 and 5.2 gradually introduced parametrization but stopped at three levels (low/medium/high). With GPT-5.2, the none level appeared for low-latency scenarios, and xhigh for maximum depth — a new ceiling that became the standard mode for benchmarks in GPT-5.4.
According to OpenRouter, the distribution coefficients for reasoning tokens by level are as follows: xhigh allocates up to 95% of max_tokens to reasoning, high — 80%, medium — 50%, low — 20%, minimal — 10%. This means that with max_tokens=4096 and xhigh, the model generates up to ~3891 hidden reasoning tokens before responding — tokens that are not displayed in the response but are charged as output tokens.
The fundamental innovation of GPT-5.4 is the support for transferring CoT (chain-of-thought) between turns via the Responses API. Previously, each request started reasoning from scratch. Now, the model can "remember" its previous thought process, which, according to OpenAI, results in improved reasoning quality, fewer generated reasoning tokens, a higher cache hit rate, and lower latency in multi-turn scenarios.
🎯 GPT-5.4 Thinking as a Separate Inference Mode: How It Differs from the Base Model
Thinking is not a separate model
GPT-5.4 Thinking is not a separate model with different weights, but the same gpt-5.4 with a fixed
reasoning.effort=xhigh in the ChatGPT interface. At the API level, there is no difference between "Thinking" and the base
GPT-5.4 with the xhigh parameter — this is a product distinction for the end-user interface,
not an architectural one. GPT-5.4 Pro is the only variant that truly differs at the compute level:
it uses a wider inference graph and is priced separately.
Confusion between "Thinking," "Pro," and base GPT-5.4 is typical for those who look at the product
name rather than API parameters. At the inference level — it's one graph with different reasoning.effort
configurations and different allocated compute.
The GPT-5.4 product line in ChatGPT includes three modes: Instant
(effort=none/low — minimal latency), Thinking
(effort=xhigh — maximum reasoning depth), and Pro
(a separate inference graph with a larger compute budget per response). From an API perspective,
the first two are a single gpt-5.4 endpoint with different parameter values.
The difference between reasoning.effort levels is significant at the token economy level.
According to OpenRouter,
the distribution of reasoning tokens by level is as follows:
xhigh — up to 95% of max_tokens is spent on hidden reasoning tokenshigh — ~80%medium — ~50%low — ~20%none — no reasoning tokens are generated
This means: with max_tokens=4096 and effort=xhigh, the model generates
up to ~3,890 hidden reasoning tokens before the final response. These tokens are not displayed
in the response but are charged as output tokens — which directly affects the cost
and latency of the request.
GPT-5.4 Pro is a special case. It differs not only in the number of
reasoning tokens but also in a wider inference graph: more beam search or an equivalent mechanism
at the sampling level. According to
Digital Applied,
GPT-5.4 Pro achieves 83.3% on ARC-AGI-2 — a benchmark for abstract reasoning,
where the base GPT-5.4 shows a lower result. Price: $30 input / $180 output per million tokens —
i.e., ~6× more expensive than the base configuration.
Practical recommendation:
effort=none/low — real-time chat, classification, template-filling. TTFT < 1–2 s,
minimal cost.
effort=high — optimal balance for most agentic tasks: refactoring,
multi-step planning, structured output. Cost is 2–3× higher than none,
but quality is significantly better.
effort=xhigh — justified for one-off complex tasks: legal analysis,
architectural decisions, complex math reasoning. Maximum quality, but cost and latency
are commensurate.
Pro — critical scenarios with zero tolerance for errors.
ARC-AGI-2: 83.3% — currently the highest result among all public models.
The innovation of GPT-5.4, which affects all modes, is the support for transferring chain-of-thought between turns
via the Responses API. Previously, each request started reasoning from scratch; now the model retains
the previous thought-process between steps of an agentic workflow. This reduces the number of reasoning tokens
on repetitive steps and increases the cache hit rate, which directly impacts the cost of
multi-turn agentic scenarios.
📊 Comparison of GPT-5.4 / Claude Opus 4.6 / Gemini 3.1 Pro by Measurable Parameters
Data current as of March 2026. Sources:
Digital Applied,
Onyx LLM Leaderboard,
Failing Fast AI Coding Benchmarks.
| Parameter | GPT-5.4 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|
| GPQA Diamond (graduate science) | 93.2% | 91.3% | 94.3% |
| SWE-bench Verified (real GitHub issues) | 80.0% | 80.8% | ~74% |
| HumanEval (code generation) | 95.0% | 95.0% | ~97.8% |
| ARC-AGI-2 (abstract reasoning) | Pro: 83.3% / base: ~75% | ~72% | 77.1% |
| OSWorld (autonomous desktop tasks) | 75% (higher than human: 72.4%) | 72.7% | N/A |
| Aider Polyglot (multi-language code edit) | 88% | 70.7% | 83.1% |
| Context Window | 1M (API) / 272K (standard) | 200K (1M beta) | 2M |
| Input / Output ($/1M tokens) | ~$15 / $60 (base) · $30 / $180 (Pro) | $15 / $75 | $2 / $12 |
| Context caching | Tool search (−47% tokens) | ✅ (up to −75% input) | ✅ (up to −75% input) |
| Computer use (native) | ✅ (built-in) | ✅ | Limited |
| CoT between turns | ✅ (Responses API) | ❌ | ❌ |
Important: benchmarks measure model behavior under standardized conditions and do not always
correlate with production results. SWE-bench and OSWorld are the most representative
for real engineering tasks — these are the ones to focus on when choosing a model for
agentic workflows.
🎯 Consolidated Model as a Compromise: Latency, Specialization, and Maintenance Cost
What is the architectural compromise?
Consolidated architecture reduces operational complexity and improves cross-domain performance,
but increases the base cost of inference: the model is larger in terms of weights than its specialized predecessors.
However, the increased token efficiency of GPT-5.4 partially offsets the price difference — but only
for complex tasks. For simple high-throughput scenarios, the consolidated model remains
redundant, and a proper model hierarchy in the product remains important.
A consolidated model is the answer to the question "which model to use for complex mixed
tasks," but it does not negate the need to think about the model hierarchy in the product.
One large hammer is more convenient than a set of tools — until you need a scalpel.
Cost of Inference: What the Numbers Really Show
Comparing list prices does not reflect the real cost of inference in production — due to caching,
distribution of reasoning tokens, and tool search. Let's break down each factor separately.
Base token cost. GPT-5.4 and Claude Opus 4.6 are in the same
price segment (~$15/$60–75 per 1M). Gemini 3.1 Pro is significantly cheaper:
$2/$12 per 1M,
which gives approximately a 5–7× advantage in cost per token with comparable benchmark results.
For cost-sensitive scenarios — this is a significant parameter.
Context caching. Gemini 3.1 Pro and Claude Opus 4.6 support context caching
with input cost reduction up to 75%. GPT-5.4, however, uses
tool search — a mechanism for dynamic discovery and loading only
relevant tools. According to
Digital Applied,
tool search reduces token consumption by 47% for agentic tasks with a large
number of tools — this more than compensates for the lack of traditional caching
for relevant scenarios.
Token efficiency of GPT-5.4. According to
Augment Code,
for complex tasks (refactoring, multi-file reasoning, architectural planning), GPT-5.4
uses 18–20% fewer tokens than its predecessors. This means: a higher price per token is partially
offset by a lower number of tokens per task — but only for complex scenarios
where the model truly "thinks more efficiently."
Where the Consolidated Model is Not the Optimal Choice
For simple high-throughput tasks — classification, summarization, template-filling, intent
detection — consolidated GPT-5.4 is redundant. Here, the optimal choices remain:
gpt-5-mini — lightweight tasks with low latency and minimal cost.
Gemini 2.5 Flash ($0.30/$2.50 per 1M) — if scalability is needed
with a large context at minimal cost.
Claude Haiku 4.5 ($1.00/$5.00 per 1M) — for scenarios where Anthropic
API and low cost are important simultaneously.
Microsoft in its recommendations for
Azure Foundry
explicitly states: for real-time chat and customer support with low latency requirements, GPT-4.1
remains a better choice than GPT-5.x — despite lower reasoning quality.
GPT-4.1 latency for simple requests is significantly lower, and response quality for standard
support scenarios is sufficient.
Practical Model Hierarchy
| Level | Model | Scenario | Approximate Cost |
|---|
| High-frequency lightweight | gpt-5-mini / Gemini 2.5 Flash | Classification, summarization, chat | < $1/1M output |
| Complex reasoning + agents | gpt-5.4 (effort=high) | Refactoring, planning, multi-step agents | ~$60/1M output |
| Long-context document work | Gemini 3.1 Pro (2M ctx) | Analysis of large codebases, documents | ~$12/1M output |
| Critical one-shot precision | gpt-5.4 Pro / Claude Opus 4.6 | Legal analysis, financial decisions | $75–180/1M output |
The consolidated architecture of GPT-5.4 eliminates the need to choose between a "code" and a "general"
model within a complex workflow — but it does not eliminate the need to think about routing
between hierarchy levels. For a product engineer, the correct routing architecture between models
remains one of the key factors for optimizing cost and latency.
❓ Frequently Asked Questions (FAQ)
Should I migrate from gpt-5.2 to gpt-5.4 right now?
It depends on the type of tasks. For agentic and coding scenarios, migration makes sense: GPT-5.4
combined the capabilities of GPT-5.3-Codex with general reasoning in one model, which means
better results on mixed workflows without routing between two endpoints. According to
VentureBeat,
GPT-5.4 generates 47% fewer tokens on some tasks compared to its predecessors —
and this partially compensates for the higher price per token for complex tasks.
For simple high-throughput tasks — classification, summarization, template-filling —
there is no need to migrate. gpt-5-mini or gpt-5.2 remain a more rational choice
in terms of cost/latency. GPT-5.2 Thinking remains in Legacy Models until June 6, 2026
, meaning there is enough time for gradual migration. Meanwhile, GPT-5.1 and GPT-4.1
do not have an announced deprecation date in the API — so legacy systems do not need to be touched in a hurry.
How does gpt-5.4 in ChatGPT differ from gpt-5.4 in the API?
It's the same model, but with different levels of control over inference parameters.
In ChatGPT, routing between the three modes (Instant, Thinking, Pro) happens automatically —
according to OpenAI documentation, "routing layer selects the best model to use" based on
the complexity of the request, and the user can explicitly switch the mode via the UI.
In the API, you get direct control: a single endpoint gpt-5.4 and the parameter
reasoning.effort with five levels (none / low /
medium / high / xhigh). ChatGPT Instant corresponds to
approximately effort=none/low, ChatGPT Thinking — effort=xhigh.
GPT-5.4 Pro is the only exception: it uses a wider inference graph and is priced separately,
so it cannot be reduced simply to a parameter value.
When does the Responses API offer a real advantage over Chat Completions?
The Responses API is a significantly better choice for agentic and multi-turn reasoning scenarios.
The main difference is support for transferring chain-of-thought between turns: the model retains
the previous thought process and does not start reasoning from scratch at each step.
According to
OpenAI's internal eval data, this yields a 3% improvement on SWE-bench,
better cache hit rate, and lower latency in multi-turn scenarios, as well as
a 40–80% cost reduction due to improved caching compared to Chat Completions.
In addition to CoT, the Responses API provides exclusive access to tool_search (deferred tool loading),
native compaction, computer use, and stateful context via store: true.
For simple stateless requests without tools, Chat Completions remains a valid
choice — it is stable, well-documented, and not planned for sunset anytime soon.
The Assistants API, however, will be closed on August 26, 2026 — so migration
from it to the Responses API is a priority.
When does CoT transfer between turns not provide an advantage?
CoT persistence is only useful when the reasoning of one step actually affects the quality
of the next. In several common scenarios, this advantage is not realized.
Firstly, single-turn and stateless requests: if each request is contextually independent —
batch classification, mass text generation from a template, parallel independent tasks —
there is nothing to transfer CoT for, and the overhead from the Responses API is not justified.
Secondly, short simple chains with effort=none/low: reasoning tokens
are either not generated or are minimal — CoT persistence in this case does not yield a
measurable effect. Thirdly, scenarios with frequent context changes: if the task context
changes drastically between agent steps (e.g., separate independent subtasks),
preserved CoT may even degrade quality, as the model will "drag" irrelevant
reasoning from the previous step.
How does tool_search affect inference cost in real agentic scenarios?
Tool search is a mechanism that allows the model to defer a large surface of tools to runtime:
instead of passing the full list of available tools in each request, the model dynamically loads
only those relevant to the current step.
According to
OpenAI Changelog, this "reduces token usage, preserves cache performance, and improves latency."
According to VentureBeat,
overall savings on some agentic tasks reach 47% of tokens compared to predecessors.
Practically, this is important for systems with a large number of registered tools —
for example, corporate agents with dozens of MCP servers or API integrations.
Previously, passing the full tool manifest in each request significantly increased input tokens
and reduced cache hit rate, as the manifest changed between requests.
Tool search solves both problems simultaneously. It is available exclusively through the Responses API —
another argument in favor of migration for agentic scenarios.
✅ Conclusions
GPT-5.4 is not just a "better version" of its predecessor. It's a shift in architectural philosophy: from a set of specialized models to a unified parameterized inference pipeline. For developers, this means less cognitive load when choosing a model, but greater responsibility for correctly configuring parameters — specifically reasoning.effort and the choice between Chat Completions and Responses API.
Key architectural conclusions:
- Consolidation of gpt-5.2 and gpt-5.3-codex into one model reduces API fragmentation but increases the base cost of inference — optimization is needed at the parameter level, not endpoint selection
- CoT between turns via Responses API — the biggest hidden advantage for agentic systems, underestimated in most reviews
- "Thinking," "Pro," and base gpt-5.4 — these are one model graph with different inference configurations, not separate architectures
- The direction of GPT-5.x indicates the growing role of the Responses API as the primary protocol and the gradual deprecation of Chat Completions for complex scenarios
In the next articles of the series, we will delve into inference under the hood — how the 1M token context window is structured, where real bottlenecks arise, and how the cost comparison of GPT-5.4 with Claude Sonnet 4.6 and Gemini looks in measurable scenarios.
Sources:
OpenAI — Introducing GPT-5.4
OpenAI API — Using GPT-5.4
Augment Code — GPT-5.4 as default model
Microsoft Azure Foundry — GPT-5 vs GPT-4.1 model choice guide
OpenRouter — Reasoning Tokens documentation
The Neuron — GPT-5.4 full breakdown
Ключові слова:
GPT-5.4CodexGemini 3.1 Pro