On May 7, 2026, OpenAI made an announcement that many in the developer community had long awaited: three new voice models in the Realtime API. The flagship, GPT-Realtime-2, is the first in the lineup to embed GPT-5 level reasoning directly into the voice stream. No delays between recognition and response. No separate pipelines.
In short: voice agents no longer have to choose between "smart" and "fast."
In short: OpenAI released GPT-Realtime-2 (GPT-5 level reasoning), GPT-Realtime-Translate (real-time translation of 70+ languages), and GPT-Realtime-Whisper (streaming transcription). All three are in the Realtime API, available now. OpenRouter won't work for this – and here's why.
Article Contents
Context: Why voice agents have always been "smart or fast" – but not both
Before this release, voice agent developers faced the same choice. Either you take a model that speaks naturally and responds quickly – but can't handle complex queries. Or you take a model with real reasoning – and get 5–7 seconds of silence between question and answer, which in a voice interface is equivalent to killing the conversation.
This problem is not new. For two years, the industry has tried to solve it by optimizing individual components – faster ASR, smaller LLM, more aggressive TTS caching. But the fundamental limitation remained: the architecture was cascaded.
The classic voice agent stack looked like this:
- ASR (Automatic Speech Recognition) – converts speech to text. Best solutions: Whisper, Deepgram, AssemblyAI. Latency: 200–500 ms.
- LLM – receives text, processes it, generates a response. If reasoning (CoT) is used – add another +2–6 seconds.
- TTS (Text-to-Speech) – converts the response back into speech. ElevenLabs, Cartesia, OpenAI TTS. Another 200–400 ms.
Total latency from the end of the question to the start of the answer – 1.5–8 seconds depending on the complexity of the query and the chosen components. This is unnoticeable in text chat. In a voice interface – it's a disaster. A person perceives a pause longer than 1.5 seconds as a glitch or freeze.
This is what led to the compromise. Teams building voice agents for call centers or support had to choose:
- Option A – Fast but Limited: small model (GPT-4o mini, Llama 3 8B), no reasoning, responds in 800–900 ms. Handles FAQs and simple scenarios, breaks on non-standard requests or multi-step tasks.
- Option B – Smart but Slow: large model with reasoning, responds in 4–7 seconds. Solves complex queries, but the conversation turns into a series of awkward pauses.
In practice, most production systems chose Option A and tried to "cover up" the model's limitations through rigid scripts, fallback phrases, and detailed system prompts. Reasoning in voice remained unattainable without sacrificing UX.
Another problem with the cascaded stack is that each component has its own failure points. ASR misrecognized a word – LLM received incorrect context – TTS voiced nonsense. Debugging such a system is difficult: an error can occur at any of the three steps, and it's often unclear where exactly.
GPT-Realtime-2 eliminates the cascaded architecture itself. The model takes audio as input and outputs audio – reasoning happens within a single loop, without conversions between formats. There are no three components – no three failure points and three cumulative delays. This is not "a better model in the same stack" – it's a replacement of the approach itself.
This is precisely why this release is important not as another increment, but as an architectural shift in how voice products are built at all.
GPT-Realtime-2, Translate, Whisper – three models for three different tasks
OpenAI released not one model, but three – and each closes a specific scenario. These are not "basic, standard, and premium" versions of the same product. These are three fundamentally different tools with different architectures, different billing, and different use cases. It's important not to confuse them with each other even at the selection stage.
GPT-Realtime-2 – a voice agent with reasoning
The flagship of the release. This is OpenAI's first voice model with GPT-5 level reasoning – a speech-to-speech model that listens to audio, thinks, and responds with audio, without converting to text between steps.
Key features:
- Context window: 128K tokens (was 32K in GPT-Realtime-1.5)
- Reasoning effort: minimal / low / medium / high / xhigh – adjustable for the task
- Billing: per token ($32/1M input, $64/1M output)
- Support: parallel tool calls, preambles, error recovery
When to use GPT-Realtime-2: support voice agents with complex scenarios, assistants performing multi-step tasks (booking, searching, changing data), any product where it's important not just to respond, but to understand context and act.
When not to use: if you only need transcription or translation, this is an overkill tool at a higher price.
GPT-Realtime-Translate – live translation between languages
A separate specialized model for real-time speech translation. Supports over 70 input languages and 13 output languages. Output languages include: English, Spanish, French, German, Japanese, Hindi, Portuguese, Arabic, and other major languages.
Key features:
- Billing: per minute ($0.034/min) – simple and predictable
- Simultaneously generates live transcripts during translation
- Keeps up with the pace of live speech, doesn't wait for the end of a sentence
- Preserves meaning with regional accents and industry terminology
When to use GPT-Realtime-Translate: international customer support (everyone speaks their own language), online education with a global audience, conferences and live streams with live translation, cross-border sales where a language barrier means a lost deal.
Important detail: this is not GPT-Realtime-2 with translation enabled. This is a separate model optimized specifically for translation – it doesn't conduct conversations or perform tasks, it translates the speech stream.
GPT-Realtime-Whisper – streaming transcription
A model that converts speech to text directly as a person speaks – not after, but during. This is not a conversational model: it doesn't respond, translate, or analyze. It transcribes.
Key features:
- Billing: per minute ($0.017/min) – the cheapest of the three
- Adjustable latency: lower settings = faster partial transcripts, higher = better quality
- Streaming: text appears word by word, not after a pause
When to use GPT-Realtime-Whisper: live subtitles for meetings and webinars, automatic notes that sync with the conversation, CRM systems that need to record calls in real-time, medical systems where a doctor dictates – and the text appears immediately in the patient's chart.
Main selection table:
Need a voice agent that understands and responds → GPT-Realtime-2
Need translation between live conversation participants → GPT-Realtime-Translate
Need text from what a person is saying → GPT-Realtime-Whisper
And separately: GPT-Realtime-Whisper ≠ classic Whisper. Classic Whisper is batch transcription of a finished audio file after recording. GPT-Realtime-Whisper is streaming, word by word, while the person is still speaking. Different tools for different scenarios – not interchangeable.
What specifically has changed: 128K context, preambles, parallel tool calls
Compared to GPT-Realtime-1.5, the new model has received five specific improvements. Let's break down each one – not as a marketing feature list, but from the perspective of what it means for a production system.
Context window: 32K → 128K tokens
This is not a cosmetic change – it's the elimination of one of the main limitations of the previous version.
32K tokens in audio context was enough for about 20–30 minutes of conversation or for several tool calls with a moderate amount of data. For a simple FAQ agent, it was quite sufficient. But for real production scenarios, it was not enough:
- A call with the full customer history (previous orders, status, complaints) – the context overflows
- An agent flow with 5–10 tool calls, each returning data – the same
- A long session where the customer returns to a topic from the beginning of the conversation – the model "forgets"
Teams solved this through external state stitching – a separate layer that stored the conversation state outside the model and manually inserted the necessary context into each request. This is additional infrastructure, additional points of failure, and additional code to maintain.
128K tokens removes the need for this layer for most scenarios. A full session, the entire customer history, several rounds of tool calls – everything fits into one context without manual state management.
Preambles – solving the "intelligent silence" problem
One of the most frustrating UX problems with voice agents: the model thinks – the user hears silence. In text chat, a spinner or "Typing..." solves this visually. In voice, there was no analogue – the pause sounded like a glitch or a freeze.
Preambles are the ability to enable short audio phrases spoken before the model starts its main response, while reasoning is happening in the background. Examples:
- "Just a moment, I'm checking..."
- "Let me look at your order."
- "One moment, I'm clarifying the details."
Technically, it's not just a text template that plays – the model generates the preamble contextually, considering what it's about to do next. "Checking your calendar" – if there's a tool call to the calendar. "Looking for information" – if there's a search. It's not a random placeholder phrase.
For UX, this is significant: the conversation doesn't break, the user knows the agent is working, and the natural rhythm of the dialogue is maintained even during complex operations.
Parallel tool calls with audio feedback
GPT-Realtime-1.5 executed tool calls sequentially. You need to check the order status and product availability – first the first request, then the second. Each adds a delay.
GPT-Realtime-2 can launch multiple tool calls simultaneously – and simultaneously announce what's happening:
- "I'm checking your order and stock availability at the same time."
- "I'm looking for available slots and checking your subscription."
For agent flows with multiple data sources, this can significantly reduce the overall response time – instead of sequentially waiting for results, requests go in parallel.
Improved error recovery
In the previous version, a failure during a tool call or a timeout often meant either silence or a session termination. For production systems, this meant the need for a separate error handling layer that would catch errors and somehow voice them.
GPT-Realtime-2 handles errors natively – the model itself announces that something went wrong and continues the conversation:
- "I'm having a problem checking the status right now – let's try another way."
- "I can't get this information right now, but I can help with..."
The conversation doesn't break – the agent gracefully exits the situation and offers an alternative.
Tone adjustment for context
A new capability to adapt the speaking style depending on the scenario. This is not just an "formal / informal" switch – the model considers the context of the conversation:
- A calmer, slower tone for complaints and complex support situations
- Clear and confident for order confirmation or important details
- More upbeat for onboarding or welcome scenarios
For brands with a distinct voice, this is an important detail. An agent that responds equally indifferently to "my order is lost" and "thank you for your purchase" is a bad agent, regardless of the quality of the response.
An important nuance regarding benchmarks: the default reasoning effort level in GPT-Realtime-2 is low. The marketing figures of +15.2% on Big Bench Audio and +13.8% on Audio MultiChallenge were obtained at the high / xhigh level. Higher effort = longer latency + more output tokens = higher cost. At the low level, the model responds faster but doesn't show the marketing figures from the press release. Start with low, measure quality on your real scenarios, and only increase effort where it's objectively necessary.
Real numbers: +26% at Zillow, Deutsche Telekom, BolnaAI – what they built
OpenAI publishes not only its own benchmarks but also the results of real companies that tested the model before its release. This is more useful than synthetic tests – because it shows not "how many points on Big Bench Audio," but what has changed in a real product with real users.
Zillow: +26 percentage points on the most complex test
Zillow is an American real estate platform with over 200 million monthly visitors. They are building a voice agent to work with buyers and renters: searching for properties, answering questions about neighborhoods, booking viewings.
The complexity of the task is not technical, but legal. In the US, the Fair Housing Act is in effect – a law that prohibits discrimination in the sale and rental of real estate. An agent cannot give recommendations based on the racial composition of a neighborhood, religion, nationality of residents, and a number of other factors. Even answering the question "what's the neighborhood like?" can become a legal problem if phrased incorrectly.
This is why Zillow uses an adversarial benchmark – tests that check not only the quality of useful answers but also resistance to "dangerous" queries. On this test:
- GPT-Realtime-1.5: 69% successful calls
- GPT-Realtime-2 after prompt optimization: 95% successful calls
- Difference: +26 percentage points
What lies behind the figure: the agent on GPT-Realtime-2 better recognizes when a query is approaching a legally dangerous zone and gracefully redirects the conversation – without interruption and without violating compliance. For Zillow, this is not just "better quality" – it's the difference between an agent that can be put into production and an agent that carries legal risk.
Example of a query handled by the agent: "Find houses within my budget of $400K, without busy streets, preferably a quiet neighborhood, book a viewing for Saturday" – several tasks in one sentence, requiring search, filtering, and booking through tool calls in parallel.
Deutsche Telekom: language barrier in support – without switching languages
Deutsche Telekom is one of Europe's largest telecommunications operators with customers in dozens of countries. Their task: support where the customer speaks their language, the operator speaks theirs, and neither has to switch.
They are testing GPT-Realtime-Translate for a scenario where a customer calls in Turkish, for example, and the operator responds in German – and the model translates both streams in real-time with live transcripts. Neither the customer nor the operator hears the translation delay as a separate pause – the translation keeps up with the pace of the conversation.
Why this is important now: the alternative is either hiring multilingual operators (expensive and limited), or transferring the customer to an "English line" (bad experience), or asynchronous support via text (slow). GPT-Realtime-Translate offers a fourth option – a live call in the customer's native language without additional personnel costs.
BolnaAI: -12.5% Word Error Rate for Indian languages
BolnaAI is building voice agents for the Indian market – one of the most complex in terms of linguistic diversity. India has 22 official languages and hundreds of dialects. Hindi, Tamil, and Telugu are three of the most common, each with unique phonetics that are poorly recognized by models trained primarily on English data.
In tests, BolnaAI GPT-Realtime-Translate showed a 12.5% reduction in Word Error Rate for these three languages compared to other tested models. Word Error Rate is the percentage of words the model recognized or translated incorrectly. A 12.5% reduction means that every eighth incorrect token is now correct – for an agent handling hundreds of calls a day, this is significant.
Practical context: recognition errors in Indian languages are often not random – they are systematic, related to speech rhythm, aspirated consonants, and code-switching (when a speaker mixes Hindi with English words in the middle of a sentence). Improvements specifically in these languages indicate that the model has become better with linguistic variability, not just loudness or accent.
Three patterns that OpenAI highlights separately – and which model covers each:
Voice-to-action – the user describes a task by voice, the agent reasons and executes it through tool calls. Zillow: "find and book." → GPT-Realtime-2
Systems-to-voice – the system itself initiates a voice message at the right moment. Example: a travel app says "your flight is delayed, but you can still make your connection – new gate X, fastest route Y." → GPT-Realtime-2
Voice-to-voice – two people speak different languages and hear each other in translation. Deutsche Telekom: customer in Turkish, operator in German. → GPT-Realtime-Translate
Pricing, availability, and reasoning effort: low / high / xhigh – what it means in practice
All three models are available through the OpenAI Realtime API right now. An important detail: along with this release, the Realtime API has officially exited beta and become generally available. For teams that postponed implementation due to beta instability – this is a green light. GA means SLA, stable endpoints, and no breaking changes without notice.
You can test it without writing code in the OpenAI Playground – there's already an interface for GPT-Realtime-2 with a microphone directly in the browser.
Pricing and billing model
The three models have different billing models – this is important to consider when planning costs:
| Model |
Billing |
Cost |
| GPT-Realtime-2 |
Per token |
$32 / 1M input tokens $0.40 / 1M cached input $64 / 1M output tokens |
| GPT-Realtime-Translate |
Per minute |
$0.034 / minute |
| GPT-Realtime-Whisper |
Per minute |
$0.017 / minute |
Several practical observations regarding cost:
GPT-Realtime-2 – unpredictable billing with variable load. Token-based billing means the cost of a call depends on its duration, complexity of responses, and the number of tool calls. A short FAQ call and a long agent session with several rounds of reasoning – completely different costs. Allocate a buffer when planning your budget and measure average token usage on real calls before scaling.
Caching input tokens ($0.40 instead of $32) – significant savings. If your system prompt is large and the same across sessions, it gets cached. With active use, this can reduce the actual cost of input by orders of magnitude. It's worth designing the architecture so that the stable part of the prompt comes first and gets cached.
GPT-Realtime-Translate and Whisper – simple and predictable billing. $0.034/min and $0.017/min respectively. 1000 minutes of translation = $34. Easy to budget and forecast as you grow.
Reasoning effort: what each level means in practice
GPT-Realtime-2 supports five levels of reasoning effort: minimal, low, medium, high, xhigh. The default is low. The choice of level affects three parameters simultaneously: depth of thought, response latency, and the number of output tokens (and thus, cost).
Here's how it looks in practice:
minimal / low – the model responds quickly, without deep reasoning. Suitable for: answering FAQs, confirming orders, simple navigation scenarios where the answer is unambiguous. Latency is minimal, cost is lowest. This is the level at which most production systems will operate 80% of the time.
medium – a balance between speed and depth. Suitable for: scenarios with multiple steps, where the context of previous remarks needs to be considered, but complex planning is not required. A good starting level for testing quality before deciding if high is needed.
high / xhigh – full reasoning. The model plans its response, considers edge cases, handles ambiguous queries and complex agent flows better. It is at these levels that the marketing benchmarks were obtained (+15.2% Big Bench Audio, +13.8% Audio MultiChallenge). Latency is noticeably higher, there are more output tokens – and the cost increases accordingly. Justified for: complex agent scenarios, compliance-sensitive tasks (like Zillow's), situations where an agent's error costs more than the delay.
Practical strategy for choosing effort: do not set xhigh "just in case." Start with low, record real calls where the agent made a mistake or gave an incomplete answer, and only increase effort for those categories of requests where it objectively improves the result. The cost difference between low and xhigh for 10,000 calls per month can be orders of magnitude – and most often it turns out that 70–80% of scenarios are perfectly handled at low or medium.
New voices: Cedar and Marin
Along with the models, OpenAI released two new voices – Cedar and Marin. They are available for GPT-Realtime-2 alongside existing ones (Alloy, Echo, Shimmer, etc.).
Cedar is a neutral, calm tone, well-suited for support and informational scenarios. Marin is slightly warmer and livelier, better for onboarding and conversion flows. The choice of voice does not affect the cost – it's a session parameter that can be switched without additional expense.
Why OpenRouter is not suitable for Realtime API — and what to use instead
This question naturally arises for developers accustomed to the convenience of aggregators. OpenRouter provides one key — and access to hundreds of models from OpenAI, Anthropic, Google, Mistral, and dozens of other providers. It's logical to try to connect GPT-Realtime-2 there as well and not complicate the infrastructure.
But there is a fundamental architectural incompatibility here — and it cannot be resolved with settings or workarounds.
What is the difference in protocols
OpenRouter works through the standard Chat Completions API — these are classic HTTP requests in a "request → response" scheme. You send a POST request with messages, receive a response, and the connection closes. Even streaming in the Chat Completions API is technically implemented via HTTP — Server-Sent Events (SSE), not a true bidirectional channel.
GPT-Realtime-2 works fundamentally differently. It uses WebSocket — a protocol that establishes a persistent bidirectional connection between the client and the server. Audio flows in both directions simultaneously and continuously: the client sends a stream of audio chunks while the user speaks, and the model responds with audio chunks in real-time even before the user finishes the sentence. This is not "request → response" — it's a constantly open channel for the entire duration of the conversation.
OpenRouter is built on HTTP infrastructure. Proxying WebSocket connections through it is not a matter of settings, it's a fundamental protocol incompatibility. It's like trying to make a video call via email — different things for different tasks.
What this means in practice
If you try to connect to GPT-Realtime-2 through OpenRouter — you will simply get a connection error or a 404. The model will not appear there even if OpenRouter adds other new OpenAI models. Realtime API exists in a separate space from Chat Completions and Responses API.
Other aggregators built on the same HTTP architecture will also not work: Together AI, Fireworks AI, Groq (for this specific model), AWS Bedrock in standard mode. Any proxy that does not support WebSocket at the infrastructure level will not work.
What to use instead
For no-code testing:
- OpenAI Playground — there is already an interface for GPT-Realtime-2 with a microphone directly in the browser. The fastest way to hear the model in action without any code.
For development:
- Direct OpenAI Key — the only way to access the Realtime API. If your project already has a key for GPT-4o or GPT-5 — it will work here too. A separate key is not needed.
- WebSocket — the primary connection method for server-side applications and Node.js. More control over the session, suitable for complex agent flows.
- WebRTC — a method for browser applications where audio is captured directly from the user's microphone. Less server infrastructure, better for client-side applications.
- SIP — for integration with telephony. If you are building an agent for real phone calls — this is the official connection method via the SIP protocol.
In short, on choosing a connection method:
Browser application with a microphone → WebRTC
Server application / Node.js / Python backend → WebSocket
Integration with telephony (real calls) → SIP
Just to see how it works → Playground
In the next technical article, we will delve into connecting GPT-Realtime-2 via WebSocket: how to open a session, how to transfer audio in chunks, how to configure preambles and parallel tool calls — with complete code in JavaScript and Python.
→ GPT-Realtime-2: Technical Guide — WebSocket API, Connection, Code Examples 2026 (article coming soon)
Conclusions: voice agents are no longer a compromise
GPT-Realtime-2 is not just another model update. It's a change in what's possible in voice AI altogether. And to avoid sounding abstract — here's specifically what has changed and for whom it's important.
What this release has truly changed
Before May 7, 2026, a voice agent with real thinking and natural conversation simultaneously was a compromise or an expensive custom solution. Now it's a single API call with a configured effort level.
Specific changes with practical significance:
- Cascading ASR → LLM → TTS stack is no longer mandatory for complex scenarios. GPT-Realtime-2 replaces it with a single connection — less infrastructure, fewer points of failure.
- 128K context eliminates the need for external state management for most production scenarios.
- Preambles and parallel tool calls solve the UX problem of "intelligent silence" that previously required a separate logic layer.
- Realtime API has exited beta — this is a signal that the infrastructure is stable and ready for production.
- Live translation of 70+ languages via GPT-Realtime-Translate becomes available without building a separate pipeline.
- Streaming transcription via GPT-Realtime-Whisper for $0.017/min — the cheapest option for live captions and notes.
Who it's relevant for right now
If you are building a product with voice support — GPT-Realtime-2 at a low effort level can already replace or significantly simplify your current stack. Results from Zillow (+26% on adversarial benchmark) and BolnaAI (-12.5% WER) show that the improvement is real, not just on synthetic tests.
If you are building an international product — GPT-Realtime-Translate removes the language barrier without hiring multilingual operators. $0.034/min for live translation between 70+ languages — this is the new pricing reality in this segment.
If you need real-time transcription — GPT-Realtime-Whisper for $0.017/min is the easiest entry into streaming transcription without building your own ASR pipeline.
If you are still just evaluating — Playground allows you to hear the model in action in five minutes without a single line of code. This is the fastest way to understand if it's suitable for your scenario.
What to do next — step by step
- Test in Playground — platform.openai.com/playground. Talk to the model, evaluate latency and naturalness for your real scenarios.
- Determine the connection method — WebSocket for server applications, WebRTC for browsers, SIP for telephony.
- Start with low effort — and only increase the level where measured quality is insufficient.
- Consider caching — a large, stable system prompt that is cached reduces the cost of input tokens from $32 to $0.40 per 1M.
- Read the technical guide — it contains step-by-step connection with code, WebSocket session configuration, examples of preambles, and troubleshooting of common errors.
The main takeaway: voice agents have ceased to be a niche technology where you have to choose between quality and speed. GPT-Realtime-2 has made "intelligent and fast simultaneously" accessible through a single API with predictable pricing. The question now is not "is it possible to build this" — but "when are you starting."
Read more
Technical article with full connection code, WebSocket session configuration, and examples of preambles and tool calls:
→ GPT-Realtime-2: Technical Guide — WebSocket API, Connection, Code Examples 2026
If you are interested in the broader OpenAI ecosystem for developers — a complete guide to Codex: models, surfaces, CLI, comparison with GitHub Copilot and Claude Code:
→ Codex from OpenAI: Complete Guide 2026
Sources: OpenAI Official Announcement, OpenAI Developer Docs — gpt-realtime-2, Realtime WebSocket Guide, Interesting Engineering, Heyloha Blog