This article is a practical guide for developers who want to connect GPT-Realtime-2 to their project. We will cover the Realtime API architecture, choose the right connection method for your scenario, write the first working session from scratch, and configure preambles, tool calls, and recovery with real code.
If you need context on what this release is about and why it's important, read the news article first: → OpenAI released GPT-Realtime-2: the first voice model with GPT-5 level thinking
In short: GPT-Realtime-2 connects via WebSocket (server), WebRTC (browser), or SIP (telephony). This is not a REST API – it's a persistent bidirectional connection. All code examples in the article are taken from the official OpenAI documentation and verified for relevance as of May 2026.
Article Contents
Realtime API vs Chat Completions API — what's the fundamental difference between the protocols
Before writing code, it's important to understand what you're working with. Realtime API and Chat Completions API are not two versions of the same tool. They are fundamentally different protocols for different tasks.
Chat Completions API — HTTP request-response
Chat Completions API works on the classic HTTP scheme:
- The client sends a POST request with messages
- The server processes and returns a response
- The connection is closed
Even streaming in Chat Completions (where text appears gradually) is implemented via Server-Sent Events (SSE) — technically, it's a unidirectional stream over HTTP, not a bidirectional connection. The client listens, the server writes — but not vice versa simultaneously.
Realtime API — persistent bidirectional WebSocket
Realtime API works via WebSocket — a protocol that establishes a persistent connection between the client and the server. After the handshake, both sides can send data at any time independently of each other:
- The client streams audio chunks while the user is speaking
- The model responds with audio chunks even before the question is finished
- Both streams go simultaneously through one connection
- The connection stays alive for the entire conversation (maximum 60 minutes per session)
Key difference for the developer: with Chat Completions, you send a request and wait. With Realtime API, you maintain a persistent connection and react to events. This is closer to developing a WebSocket chat than a regular API client. If you haven't worked with WebSockets before, allocate time to familiarize yourself with the event-driven model.
Comparison table:
| Characteristic |
Chat Completions API |
Realtime API |
| Protocol |
HTTP / SSE |
WebSocket / WebRTC / SIP |
| Connection |
Request → Response → Closed |
Persistent, bidirectional |
| Input |
Text, image, audio file |
Streaming audio chunk |
| Output |
Text (+ optional audio) |
Streaming audio + text |
| Billing |
Per token |
Per audio token (or per min for Translate/Whisper) |
| Aggregators (OpenRouter) |
Supported |
Not supported |
| Suitable for |
Chat, analysis, text generation |
Voice agents, live translation, transcription |
Speech-to-speech architecture — how the model processes audio without intermediate steps
Understanding how the model processes audio internally is important for designing the system correctly and understanding why the latency is what it is.
Old approach: cascaded pipeline
Before the advent of end-to-end voice models, the standard architecture looked like this:
Microphone → [ASR] → Text → [LLM] → Text → [TTS] → Audio
~300ms ~2-6s ~300ms
Each arrow is a separate HTTP request or service. Each step adds latency. Reasoning within the LLM increased the second step to 5–7 seconds. Total latency: 1.5–8 seconds depending on complexity.
GPT-Realtime-2: end-to-end audio
GPT-Realtime-2 replaces the entire cascade with a single model:
Microphone → [GPT-Realtime-2] → Audio
~500-1200ms (first turn)
~300-600ms (subsequent turns)
Audio comes in, reasoning happens within the model in the audio domain, and audio comes out. There's no text conversion between steps — this eliminates latency at transitions and allows the model to better understand tone, pauses, and interruptions.
Audio format
Realtime API accepts and returns audio in PCM16 format at 24,000 Hz (or Opus). This is important to know when integrating with a microphone or telephony system — the format needs to be converted before transmission.
Event-driven interaction model
The session is managed through client events and server events. The client sends events — the server responds with events. Main types:
Client events (you send):
session.update — session configuration update (instructions, voice, tools)
input_audio_buffer.append — sending an audio chunk
input_audio_buffer.commit — signal that the turn is complete
response.create — request to generate a response
conversation.item.create — adding a text message
Server events (you receive):
session.created — session is ready
session.updated — confirmation of update
response.audio.delta — audio response chunk
response.text.delta — text transcription chunk
response.function_call_arguments.delta — tool call arguments
response.done — response complete
error — session error
The maximum duration of one Realtime session is 60 minutes. After that, the connection closes, and a new session needs to be opened. Plan reconnect logic if your scenario involves longer interactions.
Three connection methods: WebSocket, WebRTC, SIP — which to choose for your scenario
Realtime API supports three connection methods. The choice depends not on preference, but on where the audio is located in your system.
WebSocket — for server-side applications
WebSocket is suitable for server-to-server integrations: Node.js backend, Python service, any server-side application that already has audio or processes it on the server.
Advantages: full control over the session, suitable for complex agent flows, can use a regular API key (ephemeral token not required).
Limitations: you are responsible for capturing and transmitting Base64-encoded audio chunks.
WebRTC — for browser applications
WebRTC is the recommended method for browser and mobile clients where audio is captured directly from the user's microphone. WebRTC is optimized for minimal latency in real-time audio.
For browser connection, an ephemeral token (short-lived key) is required — your main API key cannot be used on the client. An ephemeral token is generated on your backend and passed to the client.
SIP — for telephony
SIP is for integration with telephone systems. If you are building an agent for real phone calls, this is the official protocol. In 2026, OpenAI's native SIP endpoint is still in beta; for production, most teams use SIP via Twilio or Telnyx as a bridge to WebSocket.
Method selection table:
Browser / mobile app → WebRTC (+ ephemeral token from your backend)
Node.js / Python backend → WebSocket (direct API key)
Real phone calls → SIP (or SIP via Twilio/Telnyx → WebSocket)
Deno / Cloudflare Workers → WebSocket via standard browser WebSocket API
Connecting via WebSocket — Step-by-Step Guide with JS and Python Code
Let's explore connecting via WebSocket, the most common method for server applications. All examples are taken from the official OpenAI documentation.
Step 1. Installing Dependencies
Node.js:
npm install ws
Python:
pip install websocket-client
Step 2. Connecting to the Realtime API
JavaScript (Node.js):
import WebSocket from "ws";
const url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2";
const ws = new WebSocket(url, {
headers: {
Authorization: "Bearer " + process.env.OPENAI_API_KEY,
// If your application has user identifiers —
// pass a hashed user ID for safety monitoring
"OpenAI-Safety-Identifier": "hashed-user-id",
},
});
ws.on("open", function open() {
console.log("Connection established");
});
ws.on("message", function incoming(message) {
const event = JSON.parse(message.toString());
console.log("Event from server:", event.type);
});
ws.on("error", function error(err) {
console.error("WebSocket error:", err);
});
ws.on("close", function close() {
console.log("Connection closed");
});
Python:
import os
import json
import websocket
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2"
headers = [
"Authorization: Bearer " + OPENAI_API_KEY,
"OpenAI-Safety-Identifier: hashed-user-id",
]
def on_open(ws):
print("Connection established")
def on_message(ws, message):
data = json.loads(message)
print("Event:", data["type"])
def on_error(ws, error):
print("Error:", error)
def on_close(ws, close_status_code, close_msg):
print("Connection closed")
ws = websocket.WebSocketApp(
url,
header=headers,
on_open=on_open,
on_message=on_message,
on_error=on_error,
on_close=on_close,
)
ws.run_forever()
Step 3. Configuring the Session via session.update
Immediately after receiving the session.created event — send session.update with the configuration:
// JavaScript
ws.on("open", function open() {
// Sending session configuration
ws.send(JSON.stringify({
type: "session.update",
session: {
type: "realtime",
model: "gpt-realtime-2",
// Audio format: PCM16 24kHz
output_modalities: ["audio"],
audio: {
input: {
format: {
type: "audio/pcm",
rate: 24000,
},
// Semantic VAD — the model itself detects the end of the phrase
turn_detection: {
type: "semantic_vad"
}
},
output: {
format: {
type: "audio/pcm",
},
// Voice: alloy, echo, shimmer, cedar, marin
voice: "marin",
}
},
instructions: "You are a helpful assistant. Respond concisely and to the point.",
}
}));
});
Important about voice: The voice can only be changed if the model has not yet sent any audio responses in this session. After the first audio response — voice becomes immutable for the entire session. Plan your voice selection in advance.
First Working Session: Open Connection, Send Audio, Get Response
Now let's assemble the full cycle: connection → configuration → audio transmission → response reception.
WebSocket Session Event Flow
Here's the sequence of events for a typical voice turn:
Client Server
| |
|--- WebSocket connect --------->|
|<-- session.created ------------|
| |
|--- session.update ------------>|
|<-- session.updated ------------|
| |
|--- input_audio_buffer.append ->| (repeated for each chunk)
|--- input_audio_buffer.commit ->| (or semantic_vad will do it automatically)
|--- response.create ----------->|
| |
|<-- response.audio.delta -------| (repeated — each audio chunk)
|<-- response.text.delta --------| (response transcription)
|<-- response.done --------------|
Sending Audio in Chunks
Audio is sent as Base64-encoded chunks via input_audio_buffer.append. When using semantic_vad — the model itself determines when the user has finished speaking and initiates a response. With manual control — you send input_audio_buffer.commit yourself:
// Sending an audio chunk (Base64-encoded PCM16)
function sendAudioChunk(audioBuffer) {
const base64Audio = Buffer.from(audioBuffer).toString("base64");
ws.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: base64Audio,
}));
}
// Manually ending the turn (if not using semantic_vad)
function commitAudio() {
ws.send(JSON.stringify({
type: "input_audio_buffer.commit",
}));
// Requesting a response
ws.send(JSON.stringify({
type: "response.create",
}));
}
Handling Audio Response
ws.on("message", function incoming(message) {
const event = JSON.parse(message.toString());
switch (event.type) {
case "session.created":
console.log("Session ready:", event.session.id);
// Here we send session.update
break;
case "response.audio.delta":
// Received an audio chunk — play it
const audioChunk = Buffer.from(event.delta, "base64");
playAudio(audioChunk); // your playback function
break;
case "response.text.delta":
// Text transcription of the response
process.stdout.write(event.delta);
break;
case "response.done":
console.log("\nResponse finished");
break;
case "error":
console.error("Session error:", event.error);
break;
}
});
Preambles, Parallel Tool Calls, and Recovery — Configuration and Code Examples
Preambles — How to Set Up Phrases During Thinking
Preambles are short phrases that the model speaks while reasoning is happening in the background. They are enabled in session.update via instructions:
ws.send(JSON.stringify({
type: "session.update",
session: {
type: "realtime",
model: "gpt-realtime-2",
instructions: `You are a helpful assistant.
Before each response that requires searching or calculation —
say a short phrase: 'one moment', 'let me check',
'just a second' or similar. This will let the user know you are working.
If calling calendar_tool — say 'checking your calendar'.
If calling order_tool — say 'looking up your order'.`,
}
}));
Preambles are not a separate API parameter, but an instruction to the model in the system prompt. The GPT-Realtime-2 model follows these instructions and generates preambles contextually — depending on which tool it is about to call.
Tool Calls — Registration and Handling
Tools are registered in session.update and work the same way as function calling in Chat Completions — but asynchronously through events:
// Registering tools in the session
ws.send(JSON.stringify({
type: "session.update",
session: {
type: "realtime",
model: "gpt-realtime-2",
tools: [
{
type: "function",
name: "get_order_status",
description: "Get order status by its number",
parameters: {
type: "object",
properties: {
order_id: {
type: "string",
description: "Order number, e.g., ORD-12345"
}
},
required: ["order_id"]
}
},
{
type: "function",
name: "check_calendar",
description: "Check available slots in the calendar",
parameters: {
type: "object",
properties: {
date: {
type: "string",
description: "Date in YYYY-MM-DD format"
}
},
required: ["date"]
}
}
],
tool_choice: "auto", // the model decides when to call tools
}
}));
// Handling tool calls from the server
ws.on("message", function incoming(message) {
const event = JSON.parse(message.toString());
if (event.type === "response.function_call_arguments.done") {
const toolName = event.name;
const args = JSON.parse(event.arguments);
// Executing the tool call on your backend
const result = await executeToolCall(toolName, args);
// Returning the result to the session
ws.send(JSON.stringify({
type: "conversation.item.create",
item: {
type: "function_call_output",
call_id: event.call_id,
output: JSON.stringify(result),
}
}));
// Asking the model to continue the conversation with the result
ws.send(JSON.stringify({
type: "response.create",
}));
}
});
Recovery — Handling Session Errors
GPT-Realtime-2 handles errors natively and speaks them to the user. However, at the code level, you should also handle server-side error events:
ws.on("message", function incoming(message) {
const event = JSON.parse(message.toString());
if (event.type === "error") {
console.error("Session error:", event.error.code, event.error.message);
// If it's a rate limit error — wait and reconnect
if (event.error.code === "rate_limit_exceeded") {
setTimeout(() => reconnect(), 5000);
}
// If the session has expired — open a new one
if (event.error.code === "session_expired") {
openNewSession();
}
}
});
Reasoning effort and context window — how to balance quality, speed, and cost
Reasoning effort: five levels
GPT-Realtime-2 supports five levels of reasoning effort configurable via session.update:
ws.send(JSON.stringify({
type: "session.update",
session: {
type: "realtime",
model: "gpt-realtime-2",
// Available values: "minimal", "low", "medium", "high", "xhigh"
// Default: "low"
reasoning_effort: "medium",
}
}));
Practical level selection:
| Level |
Latency |
Cost |
When to use |
| minimal |
~300ms |
minimal |
Simple commands, navigation, yes/no answers |
| low (default) |
~500ms |
low |
FAQ, order confirmations, standard support |
| medium |
~800ms |
medium |
Multi-step scenarios, multiple tool calls |
| high |
~1.5s |
high |
Complex agent flows, compliance-sensitive tasks |
| xhigh |
~2-3s |
highest |
Maximum accuracy, critical decisions |
Practical tip: OpenAI's marketing benchmarks (+15.2% Big Bench Audio, +13.8% Audio MultiChallenge) were measured at the high / xhigh level. With the default low, the results are different. Start with low, measure quality on real scenarios, and only increase effort where there's an objective problem with response quality.
Context window 128K — what it means in practice
128K tokens of context is approximately:
- ~60–90 minutes of voice dialogue (depending on speech density)
- A full session with 10–15 tool calls and their results
- A detailed system prompt + all customer history + current conversation
Important: audio tokens cost significantly more than text tokens. With a 128K context, the total number of tokens (including audio chunks) must not exceed this limit. For very long sessions, compaction logic is needed — context compression:
// If the session is approaching the limit —
// you can stitch the context via the Responses API and start a new session
// with a compact summary of the previous conversation
// Check token usage in the response.done event:
if (event.type === "response.done") {
const usage = event.response.usage;
console.log("Tokens used:", usage.total_tokens);
if (usage.total_tokens > 100000) {
// Approaching the limit — plan for compaction
scheduleContextCompaction();
}
}
GPT-Realtime-Translate and GPT-Realtime-Whisper — connection and use cases
GPT-Realtime-Translate — live translation
Connects via the same Realtime API, but with a different session type. The official documentation recommends WebRTC for browser media and WebSocket for server-side media pipelines.
Key difference in configuration: for Translate, you don't need to call response.create — translation starts automatically as soon as there is audio.
// Session configuration for live translation
ws.send(JSON.stringify({
type: "session.update",
session: {
type: "translation", // Session type — translation
model: "gpt-realtime-translate",
translation: {
source_language: "uk", // input speech language
target_language: "en", // output language
}
// Do not call response.create — translation is automatic
}
}));
Supported output languages (13): English, Spanish, French, German, Portuguese, Japanese, Korean, Chinese (Simplified), Arabic, Hindi, Italian, Dutch, Polish.
Input languages — 70+, including Ukrainian, Russian, Turkish, Hindi, Tamil, Telugu, and others.
GPT-Realtime-Whisper — streaming transcription
For transcription, a separate session type transcription is used:
// Session configuration for streaming transcription
ws.send(JSON.stringify({
type: "session.update",
session: {
type: "transcription", // Session type — transcription
model: "gpt-realtime-whisper",
transcription: {
language: "uk", // transcription language (optional, auto-detect if not specified)
// Latency adjustment:
// low value = faster partial transcripts
// high value = better quality
delay: "low", // "low" | "medium" | "high"
}
}
}));
// Receiving transcripts
ws.on("message", function incoming(message) {
const event = JSON.parse(message.toString());
if (event.type === "conversation.item.input_audio_transcription.delta") {
// Partial transcript — appears in real time
process.stdout.write(event.delta);
}
if (event.type === "conversation.item.input_audio_transcription.completed") {
// Final transcript of the turn
console.log("\nFinal:", event.transcript);
}
});
Whisper ≠ GPT-Realtime-Whisper: the classic whisper-1 via Speech-to-Text API is batch transcription of a finished audio file after recording. GPT-Realtime-Whisper is streaming, word by word, while the person is still speaking. Different endpoints, different model, different scenario. Do not substitute one for the other.
Common mistakes and how to avoid them
❌ Attempting to connect via OpenRouter or another aggregator
OpenRouter and similar aggregators are built on HTTP infrastructure. Realtime API requires WebSocket. Architectural incompatibility — cannot be resolved by settings. The only solution: a direct OpenAI key.
❌ Using API key on the client (browser) with WebSocket
When connecting via WebSocket from a browser, do not pass the main API key — it will be visible in the code. For browser clients, use WebRTC with an ephemeral token generated on your backend:
// On your backend — generate ephemeral token
const response = await fetch("https://api.openai.com/v1/realtime/client_secrets", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
session: {
type: "realtime",
model: "gpt-realtime-2",
audio: { output: { voice: "marin" } },
},
}),
});
const data = await response.json();
const ephemeralKey = data.value; // pass to client
❌ Not handling reconnect on session_expired
Session maximum is 60 minutes. Without reconnect logic, your agent will simply freeze after an hour. Implement automatic reconnection with the transfer of a shortened context of the previous conversation.
❌ Setting xhigh effort for all requests
xhigh = maximum quality, but also maximum latency (~2-3s) and maximum output tokens. For an FAQ agent with thousands of calls per month, the cost difference between low and xhigh can be orders of magnitude. Start with low, measure, and increase pointwise.
❌ Ignoring audio format
Realtime API expects PCM16 at 24,000 Hz. If your microphone or phone system provides a different format, convert it before sending. Incorrect format will result in either silence or distorted speech without an explicit API error.
❌ Changing voice after the first audio response
The voice is fixed after the model's first audio response. Attempting to change voice via session.update afterwards is ignored without error. If you need to change the voice, open a new session.
FAQ
Do I need a separate API key for Realtime API?
No. If you already have a key for GPT-4o or GPT-5, it will work for Realtime API as well. The same key, the same account on platform.openai.com.
How to test GPT-Realtime-2 without code?
Through the OpenAI Playground — it has an interface for Realtime models with a microphone directly in the browser. The fastest way to hear the model before writing code.
Can I use GPT-Realtime-2 via OpenRouter?
No. OpenRouter is built on HTTP infrastructure, Realtime API requires WebSocket. Architectural incompatibility.
What is the maximum session duration?
60 minutes. After that, the server closes the connection. Implement reconnect logic for longer interactions.
Is Ukrainian language supported?
Yes — for GPT-Realtime-2 (understands and responds), as an input language for GPT-Realtime-Translate (70+ input languages include Ukrainian), and for GPT-Realtime-Whisper (transcription).
What is semantic_vad and when to use it?
Voice Activity Detection — a mechanism that determines the end of a user's phrase. semantic_vad uses context understanding to avoid cutting off sentences mid-way. Recommended for most scenarios — better than classic silence-based VAD which cuts sentences at natural pauses.
How are tokens counted for audio?
1 second of audio ≈ 40 tokens. At a cost of $32/1M input tokens, 1 minute of input audio costs approximately $0.077. Detailed cost calculation — in the official documentation.
Conclusions
GPT-Realtime-2 is a different type of integration compared to the Chat Completions API. Event-driven model, persistent WebSocket connection, Base64-encoded audio chunks — if you haven't worked with this stack before, factor in time for familiarization.
But architecturally, it all comes down to a few key things:
- Choose the right method: WebSocket (server), WebRTC (browser), SIP (telephony)
- Open connection → receive
session.created → send session.update
- Stream audio in chunks via
input_audio_buffer.append
- Respond to
response.audio.delta and play audio
- Handle tool calls and return results via
conversation.item.create
Start with the Playground for initial testing, then take the WebSocket example from this article and adapt it to your stack. Reasoning effort — start with low.
Context about the release itself, real-world cases, and model comparisons — in the news article:
→ OpenAI released GPT-Realtime-2: the first voice model with GPT-5 level thinking
If you are interested in the broader OpenAI ecosystem for developers:
→ Codex from OpenAI: A Complete Guide 2026
Sources: OpenAI Realtime API Overview, WebSocket Guide, WebRTC Guide, SIP Guide, Managing Conversations, Managing Costs