OpenAI Released GPT-5.4: Key Changes, Features & Benchmarks in 2026

Updated:
OpenAI Released GPT-5.4: Key Changes, Features & Benchmarks in 2026

On March 5, 2026, OpenAI released GPT-5.4 — simultaneously in ChatGPT, API, and Codex.

This is not just another incremental update: the model for the first time combines the GPT-5.3-Codex coding pipeline

with general reasoning, gains native computer use, and a context window of up to 1M tokens.

In short: if you are building agentic workflows or coding tools —

this is a release worth paying attention to today.

⚡ Key Highlights in 30 Seconds

  • Release Date: March 5, 2026, rollout in ChatGPT, API, and Codex simultaneously
  • Consolidated model: GPT-5.3-Codex and GPT-5.2 are merged into a single model — no longer need to switch between endpoints
  • Native computer use: OpenAI's first mainline model that controls a computer autonomously via Playwright and mouse/keyboard commands
  • 1M tokens of context in API (with double pricing beyond 272K)
  • −47% tokens on some agentic tasks compared to predecessors
  • −33% errors in specific assertions compared to GPT-5.2

📚 Table of Contents

🗓️ What was released and when

OpenAI officially announced GPT-5.4

on March 5, 2026. The model is immediately available across three surfaces:

  • ChatGPT — as GPT-5.4 Thinking for Plus, Team, and Pro users (replaces GPT-5.2 Thinking). GPT-5.2 Thinking remains in Legacy Models until June 5, 2026
  • API — endpoints gpt-5.4 and gpt-5.4-pro are available now
  • Codex — becomes the default model, replacing GPT-5.3-Codex

GPT-5.4 Pro is available via API and for ChatGPT Pro ($200/month) and Enterprise plans.

Free users gain access to GPT-5.4 through query auto-rotation, according to

VentureBeat.

⚙️ 3 main changes

1. No longer need to choose between GPT-5.x and Codex

Before the GPT-5.4 release, the standard architecture for an agentic pipeline with mixed tasks

looked like this: GPT-5.2 for planning and reasoning steps, GPT-5.3-Codex for generation

and code execution. Each switch between models meant a separate API call, separate context management,

different behavior in edge cases, and different fine-tuning parameters.

For long agent trajectories, this accumulated into significant overhead in terms of latency and

code complexity.

GPT-5.4 eliminates this need. According to

OpenAI,

this is the first mainline reasoning model that incorporates frontier coding capabilities

of GPT-5.3-Codex into unified weights — a result of merging training stacks, not routing logic.

In practice, this means:

  • SWE-Bench Pro: 57.7% vs 56.8% in GPT-5.3-Codex — GPT-5.4 reproduces

    the coding performance of the Codex model with lower latency and additional reasoning capabilities,

    according to gaga.art

  • GDPval: 83.0% — a new OpenAI metric, 44 professions from 9 industries,

    1320 tasks from domain specialists with 14+ years of experience. GPT-5.4 surpasses

    GPT-5.2 (70.9%) and matches or outperforms a human domain specialist in 83%

    of comparisons, according to

    The Decoder

  • Practically for developers: if your pipeline used two endpoints,

    now it's enough to change the model ID to gpt-5.4 — in most cases

    this is a swap without logic changes. GPT-5.4 becomes the default model in Codex, replacing

    GPT-5.3-Codex automatically

Separately, it's worth noting a new feature in ChatGPT Thinking: the model now shows a reasoning plan

before execution and allows to correct the direction mid-response

no need to start the query from scratch if the model went in the wrong direction. Available

on chatgpt.com and Android, iOS — coming soon, according to

DataCamp.

2. Native computer use: mechanics and real figures

GPT-5.4 is OpenAI's first general model with built-in computer use. It's important to understand

the architecture: it's not a single mechanism, but two parallel approaches that the model combines

depending on the task:

  • Code-based automation — the model writes code using Playwright or similar

    libraries to control browsers and desktop applications. Suitable for deterministic,

    repeatable workflows: forms, navigation, data extraction

  • Screenshot-based control — the model receives a screenshot of the current screen state

    and issues mouse/keyboard commands. Suitable for tasks where the UI structure is unpredictable

    or changes between sessions

Behavior is steered via developer messages and custom confirmation policies:

developers can configure which actions require user confirmation, and which

are executed autonomously — an important mechanism for production deployments with varying levels

of risk, according to

OpenAI.

Key benchmarks:

  • OSWorld-Verified: 75.0% — above the average human score (72.4%).

    For comparison: GPT-5.2 on the same benchmark showed only 47.3% — meaning an increase

    of more than 1.5×, according to

    VentureBeat

  • BrowseComp: 82.7% (base) / 89.3% (Pro) —

    measures the agent's ability to find hard-to-reach information on the internet through

    persistent browsing. GPT-5.2 showed 65.8% — an increase of 17% absolute points

To demonstrate its capabilities, OpenAI released an experimental Codex skill

Playwright (Interactive): the model can visually debug web and Electron

applications in real-time — and even test the application during its creation.

According to

DataCamp,

this combination of code generation and visual feedback loop points to a direction where AI agents

will be able to iterate on frontend with minimal human involvement.

3. Tool Search: from static manifest to on-demand discovery

This is perhaps the most practically important change for developers building systems

with a large number of tools. Previously, passing tool definitions into the system prompt

was inefficient: all schemas were loaded into context with each call,

regardless of whether they were needed at a specific step.

GPT-5.4 solves this through a new architecture: the model receives only a lightweight

list of available tools, and loads full definitions on-demand

only when it decides to use a specific tool. According to

The Decoder,

large tool ecosystems previously added tens of thousands of unnecessary tokens

to each request.

Practical effect of Tool Search:

  • −47% tokens on agentic tasks with a large number of tools,

    according to

    VentureBeat

  • Scalability: tool search allows working with ecosystems

    containing tens of thousands of tools — for example, corporate

    MCP servers or large API catalogs, according to

    Apidog

  • Cache hit rate: since the lightweight tool list is more stable between

    requests than the full manifest, caching works more efficiently — further reducing

    inference cost

  • Limitations: available exclusively via Responses API, not via

    Chat Completions

Separately, it's worth noting the improvement in accuracy: on a set of de-identified prompts,

where users previously noted factual errors, GPT-5.4 shows

−33% erroneous statements and −18% responses with any

errors compared to GPT-5.2, according to

OpenAI.

For production systems where accuracy is critical (legal analysis, financial calculations),

this is a measurable improvement in reliability.

OpenAI Released GPT-5.4: Key Changes, Features & Benchmarks in 2026

📊 Quick comparison with competitors

Current as of March 2026. Sources: Digital Applied, OpenAI, gaga.art.

Parameter GPT-5.4 Claude Opus 4.6 Gemini 3.1 Pro
Context Window 1M API / 272K standard
(beyond 272K — 2× pricing)
200K (1M beta) 2M
SWE-bench Verified 80.0% 80.8% ~74%
OSWorld (computer use) 75.0% (human: 72.4%) 72.7% N/A
BrowseComp (web agents) 82.7% / Pro: 89.3% N/A N/A
Input / Output $/1M tokens $2.50 / $15 (base)
$30 / $180 (Pro)
$15 / $75 $2 / $12
Native computer use ✅ built-in Limited
CoT between turns ✅ (Responses API)
Tool Search ✅ (−47% tokens)

💡 Full comparison with 11 parameters, inference cost analysis, and practical hierarchy model → GPT-5.4: Architectural breakdown for developers

OpenAI Released GPT-5.4: Key Changes, Features & Benchmarks in 2026

✅ What to do right now

If you have an agentic workflow or coding pipeline

  • Swap model ID to gpt-5.4 and run your evals.

    If you previously used GPT-5.3-Codex — GPT-5.4 reproduces its SWE-Bench Pro

    (57.7% vs 56.8%) with lower latency. If you used GPT-5.2 — expect

    improvements on coding tasks without reasoning degradation

  • Consider migrating to Responses API if you use Chat

    Completions with a large number of tools. Responses API unlocks Tool Search

    (−47% tokens), CoT between turns, and native compaction — three features unavailable

    via Chat Completions

  • Enable /fast mode in Codex for tasks where speed is

    critical: the same GPT-5.4, but up to 1.5× faster token velocity, according to

    target="_blank">VentureBeat

  • For 1M context window in Codex configure

    model_context_window and model_auto_compact_token_limit

    in Codex settings. Important: requests beyond the standard 272K are priced

    at 2× the normal rate, according to

    gaga.art

If you are building computer use agents

  • Use the updated computer tool in the API. In OpenAI's documentation

    there are recommendations for original and high image detail settings —

    they significantly improve localization and click accuracy

  • Configure custom confirmation policies for actions with different risk levels:

    define which operations are performed autonomously, and which require confirmation from

    the user before execution

  • Try Playwright (Interactive) in Codex for visual debugging

    web and Electron applications — an experimental skill, but already functional for real

    frontend tasks

If you have simple high-throughput tasks

  • Do not migrate in a hurry — gpt-5-mini or gpt-5.3-chat-latest remain

    the better choice for cost/latency for classification, summarization, and template-filling.

    GPT-5.4 will be overkill and more expensive for these scenarios

  • GPT-5.2 in the API has no announced deprecation date — so

    legacy systems can be left untouched for now

Key Dates

  • June 5, 2026 — GPT-5.2 Thinking is disabled in ChatGPT

    (moves to Legacy Models now, full disablement in 3 months).

    If you use it in a product via the ChatGPT interface — migrate before this date

  • August 26, 2026 — Assistants API sunset. If you are still using

    Assistants API — migration to Responses API is a priority task right now

🔬 Want to understand how it works?

This article is a brief overview of what was released. If you are interested in the engineering mechanics:

how the reasoning pipeline changed from GPT-5.0 to 5.4, why a consolidated model is

an architectural compromise, and how reasoning.effort affects cost

and latency — read the detailed breakdown:

👉

GPT-5.4 in 2026: from specialized models to consolidated architecture — what changed and why


14 min read · 5 sections · benchmarks · tables · FAQ

Sources:

OpenAI — Introducing GPT-5.4

TechCrunch — OpenAI launches GPT-5.4

VentureBeat — GPT-5.4 native computer use

Digital Applied — GPT-5.4 vs Claude vs Gemini

OpenAI Academy — GPT-5.4 Thinking and Pro

Останні статті

Читайте більше цікавих матеріалів

OpenAI випустив GPT-5.4: що змінилось  в 2026

OpenAI випустив GPT-5.4: що змінилось в 2026

5 березня 2026 року OpenAI випустив GPT-5.4 — одночасно у ChatGPT, API і Codex.Це не черговий incremental update: модель вперше об'єднує coding pipeline GPT-5.3-Codexіз загальним reasoning, отримує native computer use і контекстне вікно до 1M токенів.Коротко: якщо ви будуєте агентні воркфлоу або...

GPT-5.4 у 2026: від спеціалізованих моделей до consolidated architecture — що змінилось

GPT-5.4 у 2026: від спеціалізованих моделей до consolidated architecture — що змінилось

Протягом останніх дванадцяти місяців OpenAI послідовно розширювала свій модельний зоопарк: окремі моделі для коду, для reasoning, для агентів. GPT-5.4, випущений 5 березня 2026 року, ламає цю логіку — він об'єднує все в одну модель. Але чи є це архітектурним рішенням, чи операційним...

Perplexity Discover vs Google Discover: архітектура, механіка та різниця

Perplexity Discover vs Google Discover: архітектура, механіка та різниця

У серпні 2025 року SEO-спільнота в LinkedIn збурилась: сторінки Perplexity Discover масово з'явились у Google Search. Паттерн виглядав як класичний programmatic SEO — автоматична генерація сторінок на trending topics, сотні URL в індексі, стабільне ранжування. Дискусія розгорілась навколо тези...

Свіжість контенту та E-E-A-T для Perplexity: що важливіше і як це використати

Свіжість контенту та E-E-A-T для Perplexity: що важливіше і як це використати

Є питання, яке ставлять майже всі, хто починає серйозно працювати з присутністю у Perplexity: чому система часто ігнорує авторитетне видання з Domain Authority 80+ і натомість цитує нішевий блог, опублікований три дні тому?Відповідь криється у двох факторах, які у Perplexity працюють принципово...

Чому Perplexity цитує одні сайти і ігнорує інші аналіз патернів

Чому Perplexity цитує одні сайти і ігнорує інші аналіз патернів

1 березня 2025 · Оновлюється щоквартальноЄ питання, яке рано чи пізно ставить кожен, хто серйозно займається присутністю у Perplexity: чому система регулярно цитує відносно маловідомий нішевий блог і при цьому ігнорує матеріал Forbes на ту саму тему? Відповідь не очевидна, якщо дивитися на це через...

Чому AI-моделі обирають ядерну ескалацію у військових симуляціях

Чому AI-моделі обирають ядерну ескалацію у військових симуляціях

Останні публікації про те, що великі мовні моделі (LLM) нібито використовували тактичну ядерну зброю у більшості AI war-game сценаріїв, викликали хвилю обговорень. У симуляції, проведеній професором Kenneth Payne з King’s College London, змагалися моделі від OpenAI (GPT-5.2), Anthropic (Claude...