GPT-5.4 in 2026: From Specialized Models to Consolidated Architecture — What Has Changed

Updated:
GPT-5.4 in 2026: From Specialized Models to Consolidated Architecture — What Has Changed

Over the past twelve months, OpenAI has consistently expanded its model zoo: separate models for code, for reasoning, for agents. GPT-5.4, released on March 5, 2026, breaks this logic — it unifies everything into a single model. But is this an architectural decision or an operational simplification?

Spoiler: it's both — and that's precisely why it's worth understanding the mechanics of this transition before deciding to use the model in production.

⚡ TLDR

  • Consolidation: GPT-5.4 replaces both gpt-5.2 (general model) and gpt-5.3-codex (code model), unifying them into a single inference pipeline
  • Reasoning as a parameter: instead of separate models — a single reasoning.effort parameter from none to xhigh, which controls the number of reasoning tokens
  • CoT between turns: for the first time in a mainline model, chain-of-thought transfer between requests is supported via the Responses API — this reduces latency and increases cache hit rate
  • 🎯 You will gain: an understanding of how architectural decisions in GPT-5.4 impact model selection, inference configuration, and cost in real-world scenarios
  • 👇 Below — detailed explanations, examples, and tables

📚 Article Contents

🎯 Why OpenAI Abandoned Separate Models: The Engineering Logic of Consolidation

Why consolidation?

OpenAI approached GPT-5.4 due to accumulated operational debt: parallel support for separate

code models (GPT-5-Codex → GPT-5.1-Codex → GPT-5.2-Codex → GPT-5.3-Codex) alongside

general models (GPT-5.2, GPT-5.3 Instant) created fragmentation at the API level, documentation,

and choice for developers. GPT-5.4 eliminates this fragmentation by combining the coding capabilities

of GPT-5.3-Codex with general reasoning in a single model — the first mainline model where this

merger occurred at the training stack level, not routing logic.

The choice between a "general" and a "code" model is cognitive overhead for the developer.

Consolidated architecture shifts this choice from "which model to use"

to "which parameter to set."

How the Model Zoo Looked Before GPT-5.4

To understand the logic of consolidation, one needs to reconstruct the chronology of fragmentation. Throughout

2025–2026, OpenAI simultaneously developed two independent tracks:

General track: GPT-5 (May 2025) → GPT-5.1 (August 2025) →

GPT-5.2 (December 2025) → GPT-5.3 Instant (February 2026). Each version had sub-variants:

Instant (low latency), Thinking (extended reasoning), Pro (maximum compute).

Codex track: GPT-5-Codex (May 2025) → GPT-5.1-Codex → GPT-5.1-Codex-Max

(with context compaction) → GPT-5.2-Codex (January 14, 2026) → GPT-5.3-Codex (February 2026).

Codex models were optimized for agentic coding: long task horizons, compaction,

SWE-bench-oriented training.

According to

OpenAI Model Release Notes, GPT-5.3-Codex was the first model where Codex and GPT-5

training stacks were unified: "the first model combining Codex + GPT-5 training

stacks — bringing together best-in-class code generation, reasoning, and general-purpose

intelligence in one unified model". GPT-5.4 completes this process,

becoming the default model for all surfaces: ChatGPT, API, and Codex simultaneously.

Specific Costs of Fragmentation for Developers

Fragmentation between tracks had real operational consequences. A developer building an agentic

workflow with mixed tasks (planning, code writing, document analysis, spreadsheet work)

faced several problems simultaneously.

Routing between models within the pipeline. The standard architecture included

GPT-5.2 for planning and reasoning steps and GPT-5.2-Codex for generation steps.

Each switch between models meant a separate API call, separate context management,

potentially different parameters, and different behavior in edge cases. For long agentic

trajectories, this accumulated into significant overhead — both in latency and code complexity.

Discrepancies in documentation and behavior.

Codex Models page

and the general API documentation were separate resources with different recommendations:

"Use GPT-5-codex for coding-focused work in Codex, or Codex-like environments;

use GPT-5 for general, non-coding tasks" — a direct instruction to the developer

to think about model selection at each step of the workflow.

Context compaction as an exclusive feature. GPT-5.1-Codex-Max introduced context

compaction — a mechanism for compressing long agentic contexts without losing key information.

However, this feature was only available in Codex models. Developers on general GPT-5.x

solved the context overflow problem with their own summarization mechanisms, which added

logic and tokens. GPT-5.4 introduces native compaction for all scenarios.

What "Merging Training Stacks" Means in Practice

According to

OpenAI's official announcement, GPT-5.4 is "the first mainline reasoning model that

incorporates the frontier coding capabilities of GPT-5.3-Codex". This is architecturally

different from a simple "better prompt" or post-hoc routing:

  • Shared weights for coding and reasoning. Both capability tracks live

    in a single set of weights — the result of joint training on coding and general-purpose data,

    not an ensemble or routing between separate models.

  • Unified context window for mixed tasks. When working with an agentic

    workflow where planning (reasoning-heavy) and implementation (coding-heavy) steps alternate,

    the context is not broken between models — the model maintains the full task state.

  • Tool search as a bridge between capability tracks.

    OpenAI

    describes tool search as a mechanism that "helps agents find and use

    the right tools more effectively without loss of intelligence" — this is possible precisely because

    the model equally well understands both the general task context and the technical details

    of the tools.

Deprecation Strategy as an Indicator of Priorities

The deprecation pattern confirms that consolidation is a strategic, not tactical, decision.

According to OpenAI's data,

GPT-5.2 Thinking will be available for another three months in the Legacy Models section and will be disabled

on June 6, 2026. Meanwhile, GPT-5.1, GPT-5, and GPT-4.1 in the API have not yet received a deprecation date

OpenAI explicitly stated: "no current plans to deprecate" for API versions.

This means: OpenAI deliberately keeps older models in the API for those who need lower

cost or specific behavior, but in the product layer (ChatGPT, Codex) consistently

transitions to a consolidated approach. For developers, this is a practical signal:

new integrations should be built for GPT-5.4, while legacy systems with GPT-5.1/4.1 can

be migrated without haste.

From "Which Model" to "Which Parameter"

Consolidation shifts complexity from model selection to parameter configuration. Instead of

the decision "GPT-5.2 or GPT-5.3-Codex," it's now the decision "which reasoning.effort

and whether I need the Responses API with CoT between turns."

This is not a simplification as such — the level of complexity remains, but changes form. The advantage:

configuration parameters are versioned, documented, and predictable within a single model ID.

Choosing between two different models with different training data and different behavior

on edge cases is significantly greater uncertainty than choosing between effort=high

and effort=xhigh within a single graph.

🎯 How the Reasoning Pipeline Changed from GPT-5.0 to 5.4: Router, Thinking Track, and Steerability

Evolution of Reasoning in GPT-5.x

The key change between GPT-5.x versions is the transition from a binary "reasoning on/off" to granular control via the reasoning.effort parameter with five levels: none, low, medium, high, xhigh. GPT-5.4 also introduces support for transferring chain-of-thought between turns via the Responses API, which fundamentally changes the model's behavior in multi-turn agentic scenarios.

Reasoning effort is not just "think more or less." It's direct control over the number of reasoning tokens the model generates before responding, with all the consequences for latency and cost.

The early GPT-5 architecture (version 5.0) had hard-coded logic: the model either entered reasoning mode or it didn't. GPT-5.1 and 5.2 gradually introduced parametrization but stopped at three levels (low/medium/high). With GPT-5.2, the none level appeared for low-latency scenarios, and xhigh for maximum depth — a new ceiling that became the standard mode for benchmarks in GPT-5.4.

According to OpenRouter, the distribution coefficients for reasoning tokens by level are as follows: xhigh allocates up to 95% of max_tokens to reasoning, high — 80%, medium — 50%, low — 20%, minimal — 10%. This means that with max_tokens=4096 and xhigh, the model generates up to ~3891 hidden reasoning tokens before responding — tokens that are not displayed in the response but are charged as output tokens.

The fundamental innovation of GPT-5.4 is the support for transferring CoT (chain-of-thought) between turns via the Responses API. Previously, each request started reasoning from scratch. Now, the model can "remember" its previous thought process, which, according to OpenAI, results in improved reasoning quality, fewer generated reasoning tokens, a higher cache hit rate, and lower latency in multi-turn scenarios.

GPT-5.4 in 2026: From Specialized Models to Consolidated Architecture — What Has Changed

🎯 GPT-5.4 Thinking as a Separate Inference Mode: How It Differs from the Base Model

Thinking is not a separate model

GPT-5.4 Thinking is not a separate model with different weights, but the same gpt-5.4 with a fixed

reasoning.effort=xhigh in the ChatGPT interface. At the API level, there is no difference between "Thinking" and the base

GPT-5.4 with the xhigh parameter — this is a product distinction for the end-user interface,

not an architectural one. GPT-5.4 Pro is the only variant that truly differs at the compute level:

it uses a wider inference graph and is priced separately.

Confusion between "Thinking," "Pro," and base GPT-5.4 is typical for those who look at the product

name rather than API parameters. At the inference level — it's one graph with different reasoning.effort

configurations and different allocated compute.

The GPT-5.4 product line in ChatGPT includes three modes: Instant

(effort=none/low — minimal latency), Thinking

(effort=xhigh — maximum reasoning depth), and Pro

(a separate inference graph with a larger compute budget per response). From an API perspective,

the first two are a single gpt-5.4 endpoint with different parameter values.

The difference between reasoning.effort levels is significant at the token economy level.

According to OpenRouter,

the distribution of reasoning tokens by level is as follows:

  • xhigh — up to 95% of max_tokens is spent on hidden reasoning tokens
  • high — ~80%
  • medium — ~50%
  • low — ~20%
  • none — no reasoning tokens are generated

This means: with max_tokens=4096 and effort=xhigh, the model generates

up to ~3,890 hidden reasoning tokens before the final response. These tokens are not displayed

in the response but are charged as output tokens — which directly affects the cost

and latency of the request.

GPT-5.4 Pro is a special case. It differs not only in the number of

reasoning tokens but also in a wider inference graph: more beam search or an equivalent mechanism

at the sampling level. According to

Digital Applied,

GPT-5.4 Pro achieves 83.3% on ARC-AGI-2 — a benchmark for abstract reasoning,

where the base GPT-5.4 shows a lower result. Price: $30 input / $180 output per million tokens —

i.e., ~6× more expensive than the base configuration.

Practical recommendation:

  • effort=none/low — real-time chat, classification, template-filling. TTFT < 1–2 s,

    minimal cost.

  • effort=high — optimal balance for most agentic tasks: refactoring,

    multi-step planning, structured output. Cost is 2–3× higher than none,

    but quality is significantly better.

  • effort=xhigh — justified for one-off complex tasks: legal analysis,

    architectural decisions, complex math reasoning. Maximum quality, but cost and latency

    are commensurate.

  • Pro — critical scenarios with zero tolerance for errors.

    ARC-AGI-2: 83.3% — currently the highest result among all public models.

The innovation of GPT-5.4, which affects all modes, is the support for transferring chain-of-thought between turns

via the Responses API. Previously, each request started reasoning from scratch; now the model retains

the previous thought-process between steps of an agentic workflow. This reduces the number of reasoning tokens

on repetitive steps and increases the cache hit rate, which directly impacts the cost of

multi-turn agentic scenarios.

📊 Comparison of GPT-5.4 / Claude Opus 4.6 / Gemini 3.1 Pro by Measurable Parameters

Data current as of March 2026. Sources:

Digital Applied,

Onyx LLM Leaderboard,

Failing Fast AI Coding Benchmarks.

ParameterGPT-5.4Claude Opus 4.6Gemini 3.1 Pro
GPQA Diamond (graduate science)93.2%91.3%94.3%
SWE-bench Verified (real GitHub issues)80.0%80.8%~74%
HumanEval (code generation)95.0%95.0%~97.8%
ARC-AGI-2 (abstract reasoning)Pro: 83.3% / base: ~75%~72%77.1%
OSWorld (autonomous desktop tasks)75% (higher than human: 72.4%)72.7%N/A
Aider Polyglot (multi-language code edit)88%70.7%83.1%
Context Window1M (API) / 272K (standard)200K (1M beta)2M
Input / Output ($/1M tokens)~$15 / $60 (base) · $30 / $180 (Pro)$15 / $75$2 / $12
Context cachingTool search (−47% tokens)✅ (up to −75% input)✅ (up to −75% input)
Computer use (native)✅ (built-in)Limited
CoT between turns✅ (Responses API)

Important: benchmarks measure model behavior under standardized conditions and do not always

correlate with production results. SWE-bench and OSWorld are the most representative

for real engineering tasks — these are the ones to focus on when choosing a model for

agentic workflows.

🎯 Consolidated Model as a Compromise: Latency, Specialization, and Maintenance Cost

What is the architectural compromise?

Consolidated architecture reduces operational complexity and improves cross-domain performance,

but increases the base cost of inference: the model is larger in terms of weights than its specialized predecessors.

However, the increased token efficiency of GPT-5.4 partially offsets the price difference — but only

for complex tasks. For simple high-throughput scenarios, the consolidated model remains

redundant, and a proper model hierarchy in the product remains important.

A consolidated model is the answer to the question "which model to use for complex mixed

tasks," but it does not negate the need to think about the model hierarchy in the product.

One large hammer is more convenient than a set of tools — until you need a scalpel.

Cost of Inference: What the Numbers Really Show

Comparing list prices does not reflect the real cost of inference in production — due to caching,

distribution of reasoning tokens, and tool search. Let's break down each factor separately.

Base token cost. GPT-5.4 and Claude Opus 4.6 are in the same

price segment (~$15/$60–75 per 1M). Gemini 3.1 Pro is significantly cheaper:

$2/$12 per 1M,

which gives approximately a 5–7× advantage in cost per token with comparable benchmark results.

For cost-sensitive scenarios — this is a significant parameter.

Context caching. Gemini 3.1 Pro and Claude Opus 4.6 support context caching

with input cost reduction up to 75%. GPT-5.4, however, uses

tool search — a mechanism for dynamic discovery and loading only

relevant tools. According to

Digital Applied,

tool search reduces token consumption by 47% for agentic tasks with a large

number of tools — this more than compensates for the lack of traditional caching

for relevant scenarios.

Token efficiency of GPT-5.4. According to

Augment Code,

for complex tasks (refactoring, multi-file reasoning, architectural planning), GPT-5.4

uses 18–20% fewer tokens than its predecessors. This means: a higher price per token is partially

offset by a lower number of tokens per task — but only for complex scenarios

where the model truly "thinks more efficiently."

Where the Consolidated Model is Not the Optimal Choice

For simple high-throughput tasks — classification, summarization, template-filling, intent

detection — consolidated GPT-5.4 is redundant. Here, the optimal choices remain:

  • gpt-5-mini — lightweight tasks with low latency and minimal cost.

  • Gemini 2.5 Flash ($0.30/$2.50 per 1M) — if scalability is needed

    with a large context at minimal cost.

  • Claude Haiku 4.5 ($1.00/$5.00 per 1M) — for scenarios where Anthropic

    API and low cost are important simultaneously.

Microsoft in its recommendations for

Azure Foundry

explicitly states: for real-time chat and customer support with low latency requirements, GPT-4.1

remains a better choice than GPT-5.x — despite lower reasoning quality.

GPT-4.1 latency for simple requests is significantly lower, and response quality for standard

support scenarios is sufficient.

Practical Model Hierarchy

LevelModelScenarioApproximate Cost
High-frequency lightweightgpt-5-mini / Gemini 2.5 FlashClassification, summarization, chat< $1/1M output
Complex reasoning + agentsgpt-5.4 (effort=high)Refactoring, planning, multi-step agents~$60/1M output
Long-context document workGemini 3.1 Pro (2M ctx)Analysis of large codebases, documents~$12/1M output
Critical one-shot precisiongpt-5.4 Pro / Claude Opus 4.6Legal analysis, financial decisions$75–180/1M output

The consolidated architecture of GPT-5.4 eliminates the need to choose between a "code" and a "general"

model within a complex workflow — but it does not eliminate the need to think about routing

between hierarchy levels. For a product engineer, the correct routing architecture between models

remains one of the key factors for optimizing cost and latency.

GPT-5.4 in 2026: From Specialized Models to Consolidated Architecture — What Has Changed

❓ Frequently Asked Questions (FAQ)

Should I migrate from gpt-5.2 to gpt-5.4 right now?

It depends on the type of tasks. For agentic and coding scenarios, migration makes sense: GPT-5.4

combined the capabilities of GPT-5.3-Codex with general reasoning in one model, which means

better results on mixed workflows without routing between two endpoints. According to

VentureBeat,

GPT-5.4 generates 47% fewer tokens on some tasks compared to its predecessors —

and this partially compensates for the higher price per token for complex tasks.

For simple high-throughput tasks — classification, summarization, template-filling —

there is no need to migrate. gpt-5-mini or gpt-5.2 remain a more rational choice

in terms of cost/latency. GPT-5.2 Thinking remains in Legacy Models until June 6, 2026

, meaning there is enough time for gradual migration. Meanwhile, GPT-5.1 and GPT-4.1

do not have an announced deprecation date in the API — so legacy systems do not need to be touched in a hurry.

How does gpt-5.4 in ChatGPT differ from gpt-5.4 in the API?

It's the same model, but with different levels of control over inference parameters.

In ChatGPT, routing between the three modes (Instant, Thinking, Pro) happens automatically —

according to OpenAI documentation, "routing layer selects the best model to use" based on

the complexity of the request, and the user can explicitly switch the mode via the UI.

In the API, you get direct control: a single endpoint gpt-5.4 and the parameter

reasoning.effort with five levels (none / low /

medium / high / xhigh). ChatGPT Instant corresponds to

approximately effort=none/low, ChatGPT Thinking — effort=xhigh.

GPT-5.4 Pro is the only exception: it uses a wider inference graph and is priced separately,

so it cannot be reduced simply to a parameter value.

When does the Responses API offer a real advantage over Chat Completions?

The Responses API is a significantly better choice for agentic and multi-turn reasoning scenarios.

The main difference is support for transferring chain-of-thought between turns: the model retains

the previous thought process and does not start reasoning from scratch at each step.

According to

OpenAI's internal eval data, this yields a 3% improvement on SWE-bench,

better cache hit rate, and lower latency in multi-turn scenarios, as well as

a 40–80% cost reduction due to improved caching compared to Chat Completions.

In addition to CoT, the Responses API provides exclusive access to tool_search (deferred tool loading),

native compaction, computer use, and stateful context via store: true.

For simple stateless requests without tools, Chat Completions remains a valid

choice — it is stable, well-documented, and not planned for sunset anytime soon.

The Assistants API, however, will be closed on August 26, 2026 — so migration

from it to the Responses API is a priority.

When does CoT transfer between turns not provide an advantage?

CoT persistence is only useful when the reasoning of one step actually affects the quality

of the next. In several common scenarios, this advantage is not realized.

Firstly, single-turn and stateless requests: if each request is contextually independent —

batch classification, mass text generation from a template, parallel independent tasks —

there is nothing to transfer CoT for, and the overhead from the Responses API is not justified.

Secondly, short simple chains with effort=none/low: reasoning tokens

are either not generated or are minimal — CoT persistence in this case does not yield a

measurable effect. Thirdly, scenarios with frequent context changes: if the task context

changes drastically between agent steps (e.g., separate independent subtasks),

preserved CoT may even degrade quality, as the model will "drag" irrelevant

reasoning from the previous step.

How does tool_search affect inference cost in real agentic scenarios?

Tool search is a mechanism that allows the model to defer a large surface of tools to runtime:

instead of passing the full list of available tools in each request, the model dynamically loads

only those relevant to the current step.

According to

OpenAI Changelog, this "reduces token usage, preserves cache performance, and improves latency."

According to VentureBeat,

overall savings on some agentic tasks reach 47% of tokens compared to predecessors.

Practically, this is important for systems with a large number of registered tools —

for example, corporate agents with dozens of MCP servers or API integrations.

Previously, passing the full tool manifest in each request significantly increased input tokens

and reduced cache hit rate, as the manifest changed between requests.

Tool search solves both problems simultaneously. It is available exclusively through the Responses API —

another argument in favor of migration for agentic scenarios.

✅ Conclusions

GPT-5.4 is not just a "better version" of its predecessor. It's a shift in architectural philosophy: from a set of specialized models to a unified parameterized inference pipeline. For developers, this means less cognitive load when choosing a model, but greater responsibility for correctly configuring parameters — specifically reasoning.effort and the choice between Chat Completions and Responses API.

Key architectural conclusions:

  • Consolidation of gpt-5.2 and gpt-5.3-codex into one model reduces API fragmentation but increases the base cost of inference — optimization is needed at the parameter level, not endpoint selection
  • CoT between turns via Responses API — the biggest hidden advantage for agentic systems, underestimated in most reviews
  • "Thinking," "Pro," and base gpt-5.4 — these are one model graph with different inference configurations, not separate architectures
  • The direction of GPT-5.x indicates the growing role of the Responses API as the primary protocol and the gradual deprecation of Chat Completions for complex scenarios

In the next articles of the series, we will delve into inference under the hood — how the 1M token context window is structured, where real bottlenecks arise, and how the cost comparison of GPT-5.4 with Claude Sonnet 4.6 and Gemini looks in measurable scenarios.

Sources:

OpenAI — Introducing GPT-5.4

OpenAI API — Using GPT-5.4

Augment Code — GPT-5.4 as default model

Microsoft Azure Foundry — GPT-5 vs GPT-4.1 model choice guide

OpenRouter — Reasoning Tokens documentation

The Neuron — GPT-5.4 full breakdown

Останні статті

Читайте більше цікавих матеріалів

OpenAI випустив GPT-5.4: що змінилось  в 2026

OpenAI випустив GPT-5.4: що змінилось в 2026

5 березня 2026 року OpenAI випустив GPT-5.4 — одночасно у ChatGPT, API і Codex.Це не черговий incremental update: модель вперше об'єднує coding pipeline GPT-5.3-Codexіз загальним reasoning, отримує native computer use і контекстне вікно до 1M токенів.Коротко: якщо ви будуєте агентні воркфлоу або...

GPT-5.4 у 2026: від спеціалізованих моделей до consolidated architecture — що змінилось

GPT-5.4 у 2026: від спеціалізованих моделей до consolidated architecture — що змінилось

Протягом останніх дванадцяти місяців OpenAI послідовно розширювала свій модельний зоопарк: окремі моделі для коду, для reasoning, для агентів. GPT-5.4, випущений 5 березня 2026 року, ламає цю логіку — він об'єднує все в одну модель. Але чи є це архітектурним рішенням, чи операційним...

Perplexity Discover vs Google Discover: архітектура, механіка та різниця

Perplexity Discover vs Google Discover: архітектура, механіка та різниця

У серпні 2025 року SEO-спільнота в LinkedIn збурилась: сторінки Perplexity Discover масово з'явились у Google Search. Паттерн виглядав як класичний programmatic SEO — автоматична генерація сторінок на trending topics, сотні URL в індексі, стабільне ранжування. Дискусія розгорілась навколо тези...

Свіжість контенту та E-E-A-T для Perplexity: що важливіше і як це використати

Свіжість контенту та E-E-A-T для Perplexity: що важливіше і як це використати

Є питання, яке ставлять майже всі, хто починає серйозно працювати з присутністю у Perplexity: чому система часто ігнорує авторитетне видання з Domain Authority 80+ і натомість цитує нішевий блог, опублікований три дні тому?Відповідь криється у двох факторах, які у Perplexity працюють принципово...

Чому Perplexity цитує одні сайти і ігнорує інші аналіз патернів

Чому Perplexity цитує одні сайти і ігнорує інші аналіз патернів

1 березня 2025 · Оновлюється щоквартальноЄ питання, яке рано чи пізно ставить кожен, хто серйозно займається присутністю у Perplexity: чому система регулярно цитує відносно маловідомий нішевий блог і при цьому ігнорує матеріал Forbes на ту саму тему? Відповідь не очевидна, якщо дивитися на це через...

Чому AI-моделі обирають ядерну ескалацію у військових симуляціях

Чому AI-моделі обирають ядерну ескалацію у військових симуляціях

Останні публікації про те, що великі мовні моделі (LLM) нібито використовували тактичну ядерну зброю у більшості AI war-game сценаріїв, викликали хвилю обговорень. У симуляції, проведеній професором Kenneth Payne з King’s College London, змагалися моделі від OpenAI (GPT-5.2), Anthropic (Claude...