On March 5, 2026, OpenAI released GPT-5.4 — simultaneously in ChatGPT, API, and Codex.
This is not just another incremental update: the model for the first time combines the GPT-5.3-Codex coding pipeline
with general reasoning, gains native computer use, and a context window of up to 1M tokens.
In short: if you are building agentic workflows or coding tools —
this is a release worth paying attention to today.
⚡ Key Highlights in 30 Seconds
- ✅ Release Date: March 5, 2026, rollout in ChatGPT, API, and Codex simultaneously
- ✅ Consolidated model: GPT-5.3-Codex and GPT-5.2 are merged into a single model — no longer need to switch between endpoints
- ✅ Native computer use: OpenAI's first mainline model that controls a computer autonomously via Playwright and mouse/keyboard commands
- ✅ 1M tokens of context in API (with double pricing beyond 272K)
- ✅ −47% tokens on some agentic tasks compared to predecessors
- ✅ −33% errors in specific assertions compared to GPT-5.2
📚 Table of Contents
- 📌 What was released and when
- 📌 3 main changes for developers
- 📌 Quick comparison with competitors
- 📌 What to do right now
- 📌 Want to go deeper?
🗓️ What was released and when
OpenAI officially announced GPT-5.4
on March 5, 2026. The model is immediately available across three surfaces:
- ChatGPT — as GPT-5.4 Thinking for Plus, Team, and Pro users (replaces GPT-5.2 Thinking). GPT-5.2 Thinking remains in Legacy Models until June 5, 2026
- API — endpoints
gpt-5.4andgpt-5.4-proare available now - Codex — becomes the default model, replacing GPT-5.3-Codex
GPT-5.4 Pro is available via API and for ChatGPT Pro ($200/month) and Enterprise plans.
Free users gain access to GPT-5.4 through query auto-rotation, according to
⚙️ 3 main changes
1. No longer need to choose between GPT-5.x and Codex
Before the GPT-5.4 release, the standard architecture for an agentic pipeline with mixed tasks
looked like this: GPT-5.2 for planning and reasoning steps, GPT-5.3-Codex for generation
and code execution. Each switch between models meant a separate API call, separate context management,
different behavior in edge cases, and different fine-tuning parameters.
For long agent trajectories, this accumulated into significant overhead in terms of latency and
code complexity.
GPT-5.4 eliminates this need. According to
this is the first mainline reasoning model that incorporates frontier coding capabilities
of GPT-5.3-Codex into unified weights — a result of merging training stacks, not routing logic.
In practice, this means:
SWE-Bench Pro: 57.7% vs 56.8% in GPT-5.3-Codex — GPT-5.4 reproduces
the coding performance of the Codex model with lower latency and additional reasoning capabilities,
according to gaga.art
GDPval: 83.0% — a new OpenAI metric, 44 professions from 9 industries,
1320 tasks from domain specialists with 14+ years of experience. GPT-5.4 surpasses
GPT-5.2 (70.9%) and matches or outperforms a human domain specialist in 83%
of comparisons, according to
Practically for developers: if your pipeline used two endpoints,
now it's enough to change the model ID to
gpt-5.4— in most casesthis is a swap without logic changes. GPT-5.4 becomes the default model in Codex, replacing
GPT-5.3-Codex automatically
Separately, it's worth noting a new feature in ChatGPT Thinking: the model now shows a reasoning plan
before execution and allows to correct the direction mid-response —
no need to start the query from scratch if the model went in the wrong direction. Available
on chatgpt.com and Android, iOS — coming soon, according to
2. Native computer use: mechanics and real figures
GPT-5.4 is OpenAI's first general model with built-in computer use. It's important to understand
the architecture: it's not a single mechanism, but two parallel approaches that the model combines
depending on the task:
Code-based automation — the model writes code using Playwright or similar
libraries to control browsers and desktop applications. Suitable for deterministic,
repeatable workflows: forms, navigation, data extraction
Screenshot-based control — the model receives a screenshot of the current screen state
and issues mouse/keyboard commands. Suitable for tasks where the UI structure is unpredictable
or changes between sessions
Behavior is steered via developer messages and custom confirmation policies:
developers can configure which actions require user confirmation, and which
are executed autonomously — an important mechanism for production deployments with varying levels
of risk, according to
Key benchmarks:
OSWorld-Verified: 75.0% — above the average human score (72.4%).
For comparison: GPT-5.2 on the same benchmark showed only 47.3% — meaning an increase
of more than 1.5×, according to
BrowseComp: 82.7% (base) / 89.3% (Pro) —
measures the agent's ability to find hard-to-reach information on the internet through
persistent browsing. GPT-5.2 showed 65.8% — an increase of 17% absolute points
To demonstrate its capabilities, OpenAI released an experimental Codex skill
Playwright (Interactive): the model can visually debug web and Electron
applications in real-time — and even test the application during its creation.
According to
this combination of code generation and visual feedback loop points to a direction where AI agents
will be able to iterate on frontend with minimal human involvement.
3. Tool Search: from static manifest to on-demand discovery
This is perhaps the most practically important change for developers building systems
with a large number of tools. Previously, passing tool definitions into the system prompt
was inefficient: all schemas were loaded into context with each call,
regardless of whether they were needed at a specific step.
GPT-5.4 solves this through a new architecture: the model receives only a lightweight
list of available tools, and loads full definitions on-demand
only when it decides to use a specific tool. According to
large tool ecosystems previously added tens of thousands of unnecessary tokens
to each request.
Practical effect of Tool Search:
−47% tokens on agentic tasks with a large number of tools,
according to
Scalability: tool search allows working with ecosystems
containing tens of thousands of tools — for example, corporate
MCP servers or large API catalogs, according to
Cache hit rate: since the lightweight tool list is more stable between
requests than the full manifest, caching works more efficiently — further reducing
inference cost
Limitations: available exclusively via Responses API, not via
Chat Completions
Separately, it's worth noting the improvement in accuracy: on a set of de-identified prompts,
where users previously noted factual errors, GPT-5.4 shows
−33% erroneous statements and −18% responses with any
errors compared to GPT-5.2, according to
For production systems where accuracy is critical (legal analysis, financial calculations),
this is a measurable improvement in reliability.