GPT-Realtime-2 WebSocket API: How to Connect, Configure, and Build Voice Agents in 2026

Updated:
Ask AI about this article
GPT-Realtime-2 WebSocket API: How to Connect,  Configure, and Build Voice Agents in 2026

This article is a practical guide for developers who want to connect GPT-Realtime-2 to their project. We will cover the Realtime API architecture, choose the right connection method for your scenario, write the first working session from scratch, and configure preambles, tool calls, and recovery with real code.

If you need context on what this release is about and why it's important, read the news article first: OpenAI released GPT-Realtime-2: the first voice model with GPT-5 level thinking

In short: GPT-Realtime-2 connects via WebSocket (server), WebRTC (browser), or SIP (telephony). This is not a REST API – it's a persistent bidirectional connection. All code examples in the article are taken from the official OpenAI documentation and verified for relevance as of May 2026.

Article Contents

Realtime API vs Chat Completions API — what's the fundamental difference between the protocols

Before writing code, it's important to understand what you're working with. Realtime API and Chat Completions API are not two versions of the same tool. They are fundamentally different protocols for different tasks.

Chat Completions API — HTTP request-response

Chat Completions API works on the classic HTTP scheme:

  1. The client sends a POST request with messages
  2. The server processes and returns a response
  3. The connection is closed

Even streaming in Chat Completions (where text appears gradually) is implemented via Server-Sent Events (SSE) — technically, it's a unidirectional stream over HTTP, not a bidirectional connection. The client listens, the server writes — but not vice versa simultaneously.

Realtime API — persistent bidirectional WebSocket

Realtime API works via WebSocket — a protocol that establishes a persistent connection between the client and the server. After the handshake, both sides can send data at any time independently of each other:

  • The client streams audio chunks while the user is speaking
  • The model responds with audio chunks even before the question is finished
  • Both streams go simultaneously through one connection
  • The connection stays alive for the entire conversation (maximum 60 minutes per session)
Key difference for the developer: with Chat Completions, you send a request and wait. With Realtime API, you maintain a persistent connection and react to events. This is closer to developing a WebSocket chat than a regular API client. If you haven't worked with WebSockets before, allocate time to familiarize yourself with the event-driven model.

Comparison table:

Characteristic Chat Completions API Realtime API
Protocol HTTP / SSE WebSocket / WebRTC / SIP
Connection Request → Response → Closed Persistent, bidirectional
Input Text, image, audio file Streaming audio chunk
Output Text (+ optional audio) Streaming audio + text
Billing Per token Per audio token (or per min for Translate/Whisper)
Aggregators (OpenRouter) Supported Not supported
Suitable for Chat, analysis, text generation Voice agents, live translation, transcription

Speech-to-speech architecture — how the model processes audio without intermediate steps

Understanding how the model processes audio internally is important for designing the system correctly and understanding why the latency is what it is.

Old approach: cascaded pipeline

Before the advent of end-to-end voice models, the standard architecture looked like this:

Microphone → [ASR] → Text → [LLM] → Text → [TTS] → Audio
           ~300ms          ~2-6s          ~300ms

Each arrow is a separate HTTP request or service. Each step adds latency. Reasoning within the LLM increased the second step to 5–7 seconds. Total latency: 1.5–8 seconds depending on complexity.

GPT-Realtime-2: end-to-end audio

GPT-Realtime-2 replaces the entire cascade with a single model:

Microphone → [GPT-Realtime-2] → Audio
                ~500-1200ms (first turn)
                ~300-600ms  (subsequent turns)

Audio comes in, reasoning happens within the model in the audio domain, and audio comes out. There's no text conversion between steps — this eliminates latency at transitions and allows the model to better understand tone, pauses, and interruptions.

Audio format

Realtime API accepts and returns audio in PCM16 format at 24,000 Hz (or Opus). This is important to know when integrating with a microphone or telephony system — the format needs to be converted before transmission.

Event-driven interaction model

The session is managed through client events and server events. The client sends events — the server responds with events. Main types:

Client events (you send):

  • session.update — session configuration update (instructions, voice, tools)
  • input_audio_buffer.append — sending an audio chunk
  • input_audio_buffer.commit — signal that the turn is complete
  • response.create — request to generate a response
  • conversation.item.create — adding a text message

Server events (you receive):

  • session.created — session is ready
  • session.updated — confirmation of update
  • response.audio.delta — audio response chunk
  • response.text.delta — text transcription chunk
  • response.function_call_arguments.delta — tool call arguments
  • response.done — response complete
  • error — session error
The maximum duration of one Realtime session is 60 minutes. After that, the connection closes, and a new session needs to be opened. Plan reconnect logic if your scenario involves longer interactions.

Three connection methods: WebSocket, WebRTC, SIP — which to choose for your scenario

Realtime API supports three connection methods. The choice depends not on preference, but on where the audio is located in your system.

WebSocket — for server-side applications

WebSocket is suitable for server-to-server integrations: Node.js backend, Python service, any server-side application that already has audio or processes it on the server.

Advantages: full control over the session, suitable for complex agent flows, can use a regular API key (ephemeral token not required).

Limitations: you are responsible for capturing and transmitting Base64-encoded audio chunks.

WebRTC — for browser applications

WebRTC is the recommended method for browser and mobile clients where audio is captured directly from the user's microphone. WebRTC is optimized for minimal latency in real-time audio.

For browser connection, an ephemeral token (short-lived key) is required — your main API key cannot be used on the client. An ephemeral token is generated on your backend and passed to the client.

SIP — for telephony

SIP is for integration with telephone systems. If you are building an agent for real phone calls, this is the official protocol. In 2026, OpenAI's native SIP endpoint is still in beta; for production, most teams use SIP via Twilio or Telnyx as a bridge to WebSocket.

Method selection table:

Browser / mobile app → WebRTC (+ ephemeral token from your backend)
Node.js / Python backend → WebSocket (direct API key)
Real phone calls → SIP (or SIP via Twilio/Telnyx → WebSocket)
Deno / Cloudflare Workers → WebSocket via standard browser WebSocket API

Connecting via WebSocket — Step-by-Step Guide with JS and Python Code

Let's explore connecting via WebSocket, the most common method for server applications. All examples are taken from the official OpenAI documentation.

Step 1. Installing Dependencies

Node.js:

npm install ws

Python:

pip install websocket-client

Step 2. Connecting to the Realtime API

JavaScript (Node.js):

import WebSocket from "ws";

const url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2";

const ws = new WebSocket(url, {
  headers: {
    Authorization: "Bearer " + process.env.OPENAI_API_KEY,
    // If your application has user identifiers —
    // pass a hashed user ID for safety monitoring
    "OpenAI-Safety-Identifier": "hashed-user-id",
  },
});

ws.on("open", function open() {
  console.log("Connection established");
});

ws.on("message", function incoming(message) {
  const event = JSON.parse(message.toString());
  console.log("Event from server:", event.type);
});

ws.on("error", function error(err) {
  console.error("WebSocket error:", err);
});

ws.on("close", function close() {
  console.log("Connection closed");
});

Python:

import os
import json
import websocket

OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
url = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2"

headers = [
    "Authorization: Bearer " + OPENAI_API_KEY,
    "OpenAI-Safety-Identifier: hashed-user-id",
]

def on_open(ws):
    print("Connection established")

def on_message(ws, message):
    data = json.loads(message)
    print("Event:", data["type"])

def on_error(ws, error):
    print("Error:", error)

def on_close(ws, close_status_code, close_msg):
    print("Connection closed")

ws = websocket.WebSocketApp(
    url,
    header=headers,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close,
)

ws.run_forever()

Step 3. Configuring the Session via session.update

Immediately after receiving the session.created event — send session.update with the configuration:

// JavaScript
ws.on("open", function open() {
  // Sending session configuration
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      type: "realtime",
      model: "gpt-realtime-2",
      // Audio format: PCM16 24kHz
      output_modalities: ["audio"],
      audio: {
        input: {
          format: {
            type: "audio/pcm",
            rate: 24000,
          },
          // Semantic VAD — the model itself detects the end of the phrase
          turn_detection: {
            type: "semantic_vad"
          }
        },
        output: {
          format: {
            type: "audio/pcm",
          },
          // Voice: alloy, echo, shimmer, cedar, marin
          voice: "marin",
        }
      },
      instructions: "You are a helpful assistant. Respond concisely and to the point.",
    }
  }));
});
Important about voice: The voice can only be changed if the model has not yet sent any audio responses in this session. After the first audio response — voice becomes immutable for the entire session. Plan your voice selection in advance.

First Working Session: Open Connection, Send Audio, Get Response

Now let's assemble the full cycle: connection → configuration → audio transmission → response reception.

WebSocket Session Event Flow

Here's the sequence of events for a typical voice turn:

Client                          Server
  |                                |
  |--- WebSocket connect --------->|
  |<-- session.created ------------|
  |                                |
  |--- session.update ------------>|
  |<-- session.updated ------------|
  |                                |
  |--- input_audio_buffer.append ->| (repeated for each chunk)
  |--- input_audio_buffer.commit ->| (or semantic_vad will do it automatically)
  |--- response.create ----------->|
  |                                |
  |<-- response.audio.delta -------| (repeated — each audio chunk)
  |<-- response.text.delta --------| (response transcription)
  |<-- response.done --------------|

Sending Audio in Chunks

Audio is sent as Base64-encoded chunks via input_audio_buffer.append. When using semantic_vad — the model itself determines when the user has finished speaking and initiates a response. With manual control — you send input_audio_buffer.commit yourself:

// Sending an audio chunk (Base64-encoded PCM16)
function sendAudioChunk(audioBuffer) {
  const base64Audio = Buffer.from(audioBuffer).toString("base64");

  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: base64Audio,
  }));
}

// Manually ending the turn (if not using semantic_vad)
function commitAudio() {
  ws.send(JSON.stringify({
    type: "input_audio_buffer.commit",
  }));

  // Requesting a response
  ws.send(JSON.stringify({
    type: "response.create",
  }));
}

Handling Audio Response

ws.on("message", function incoming(message) {
  const event = JSON.parse(message.toString());

  switch (event.type) {
    case "session.created":
      console.log("Session ready:", event.session.id);
      // Here we send session.update
      break;

    case "response.audio.delta":
      // Received an audio chunk — play it
      const audioChunk = Buffer.from(event.delta, "base64");
      playAudio(audioChunk); // your playback function
      break;

    case "response.text.delta":
      // Text transcription of the response
      process.stdout.write(event.delta);
      break;

    case "response.done":
      console.log("\nResponse finished");
      break;

    case "error":
      console.error("Session error:", event.error);
      break;
  }
});

Preambles, Parallel Tool Calls, and Recovery — Configuration and Code Examples

Preambles — How to Set Up Phrases During Thinking

Preambles are short phrases that the model speaks while reasoning is happening in the background. They are enabled in session.update via instructions:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    type: "realtime",
    model: "gpt-realtime-2",
    instructions: `You are a helpful assistant.

Before each response that requires searching or calculation —
say a short phrase: 'one moment', 'let me check',
'just a second' or similar. This will let the user know you are working.

If calling calendar_tool — say 'checking your calendar'.
If calling order_tool — say 'looking up your order'.`,
  }
}));
Preambles are not a separate API parameter, but an instruction to the model in the system prompt. The GPT-Realtime-2 model follows these instructions and generates preambles contextually — depending on which tool it is about to call.

Tool Calls — Registration and Handling

Tools are registered in session.update and work the same way as function calling in Chat Completions — but asynchronously through events:

// Registering tools in the session
ws.send(JSON.stringify({
  type: "session.update",
  session: {
    type: "realtime",
    model: "gpt-realtime-2",
    tools: [
      {
        type: "function",
        name: "get_order_status",
        description: "Get order status by its number",
        parameters: {
          type: "object",
          properties: {
            order_id: {
              type: "string",
              description: "Order number, e.g., ORD-12345"
            }
          },
          required: ["order_id"]
        }
      },
      {
        type: "function",
        name: "check_calendar",
        description: "Check available slots in the calendar",
        parameters: {
          type: "object",
          properties: {
            date: {
              type: "string",
              description: "Date in YYYY-MM-DD format"
            }
          },
          required: ["date"]
        }
      }
    ],
    tool_choice: "auto", // the model decides when to call tools
  }
}));

// Handling tool calls from the server
ws.on("message", function incoming(message) {
  const event = JSON.parse(message.toString());

  if (event.type === "response.function_call_arguments.done") {
    const toolName = event.name;
    const args = JSON.parse(event.arguments);

    // Executing the tool call on your backend
    const result = await executeToolCall(toolName, args);

    // Returning the result to the session
    ws.send(JSON.stringify({
      type: "conversation.item.create",
      item: {
        type: "function_call_output",
        call_id: event.call_id,
        output: JSON.stringify(result),
      }
    }));

    // Asking the model to continue the conversation with the result
    ws.send(JSON.stringify({
      type: "response.create",
    }));
  }
});

Recovery — Handling Session Errors

GPT-Realtime-2 handles errors natively and speaks them to the user. However, at the code level, you should also handle server-side error events:

ws.on("message", function incoming(message) {
  const event = JSON.parse(message.toString());

  if (event.type === "error") {
    console.error("Session error:", event.error.code, event.error.message);

    // If it's a rate limit error — wait and reconnect
    if (event.error.code === "rate_limit_exceeded") {
      setTimeout(() => reconnect(), 5000);
    }

    // If the session has expired — open a new one
    if (event.error.code === "session_expired") {
      openNewSession();
    }
  }
});

Reasoning effort and context window — how to balance quality, speed, and cost

Reasoning effort: five levels

GPT-Realtime-2 supports five levels of reasoning effort configurable via session.update:

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    type: "realtime",
    model: "gpt-realtime-2",
    // Available values: "minimal", "low", "medium", "high", "xhigh"
    // Default: "low"
    reasoning_effort: "medium",
  }
}));

Practical level selection:

Level Latency Cost When to use
minimal ~300ms minimal Simple commands, navigation, yes/no answers
low (default) ~500ms low FAQ, order confirmations, standard support
medium ~800ms medium Multi-step scenarios, multiple tool calls
high ~1.5s high Complex agent flows, compliance-sensitive tasks
xhigh ~2-3s highest Maximum accuracy, critical decisions
Practical tip: OpenAI's marketing benchmarks (+15.2% Big Bench Audio, +13.8% Audio MultiChallenge) were measured at the high / xhigh level. With the default low, the results are different. Start with low, measure quality on real scenarios, and only increase effort where there's an objective problem with response quality.

Context window 128K — what it means in practice

128K tokens of context is approximately:

  • ~60–90 minutes of voice dialogue (depending on speech density)
  • A full session with 10–15 tool calls and their results
  • A detailed system prompt + all customer history + current conversation

Important: audio tokens cost significantly more than text tokens. With a 128K context, the total number of tokens (including audio chunks) must not exceed this limit. For very long sessions, compaction logic is needed — context compression:

// If the session is approaching the limit —
// you can stitch the context via the Responses API and start a new session
// with a compact summary of the previous conversation

// Check token usage in the response.done event:
if (event.type === "response.done") {
  const usage = event.response.usage;
  console.log("Tokens used:", usage.total_tokens);

  if (usage.total_tokens > 100000) {
    // Approaching the limit — plan for compaction
    scheduleContextCompaction();
  }
}

GPT-Realtime-Translate and GPT-Realtime-Whisper — connection and use cases

GPT-Realtime-Translate — live translation

Connects via the same Realtime API, but with a different session type. The official documentation recommends WebRTC for browser media and WebSocket for server-side media pipelines.

Key difference in configuration: for Translate, you don't need to call response.create — translation starts automatically as soon as there is audio.

// Session configuration for live translation
ws.send(JSON.stringify({
  type: "session.update",
  session: {
    type: "translation", // Session type — translation
    model: "gpt-realtime-translate",
    translation: {
      source_language: "uk", // input speech language
      target_language: "en", // output language
    }
    // Do not call response.create — translation is automatic
  }
}));

Supported output languages (13): English, Spanish, French, German, Portuguese, Japanese, Korean, Chinese (Simplified), Arabic, Hindi, Italian, Dutch, Polish.

Input languages — 70+, including Ukrainian, Russian, Turkish, Hindi, Tamil, Telugu, and others.

GPT-Realtime-Whisper — streaming transcription

For transcription, a separate session type transcription is used:

// Session configuration for streaming transcription
ws.send(JSON.stringify({
  type: "session.update",
  session: {
    type: "transcription", // Session type — transcription
    model: "gpt-realtime-whisper",
    transcription: {
      language: "uk", // transcription language (optional, auto-detect if not specified)
      // Latency adjustment:
      // low value = faster partial transcripts
      // high value = better quality
      delay: "low", // "low" | "medium" | "high"
    }
  }
}));

// Receiving transcripts
ws.on("message", function incoming(message) {
  const event = JSON.parse(message.toString());

  if (event.type === "conversation.item.input_audio_transcription.delta") {
    // Partial transcript — appears in real time
    process.stdout.write(event.delta);
  }

  if (event.type === "conversation.item.input_audio_transcription.completed") {
    // Final transcript of the turn
    console.log("\nFinal:", event.transcript);
  }
});
Whisper ≠ GPT-Realtime-Whisper: the classic whisper-1 via Speech-to-Text API is batch transcription of a finished audio file after recording. GPT-Realtime-Whisper is streaming, word by word, while the person is still speaking. Different endpoints, different model, different scenario. Do not substitute one for the other.

Common mistakes and how to avoid them

❌ Attempting to connect via OpenRouter or another aggregator

OpenRouter and similar aggregators are built on HTTP infrastructure. Realtime API requires WebSocket. Architectural incompatibility — cannot be resolved by settings. The only solution: a direct OpenAI key.

❌ Using API key on the client (browser) with WebSocket

When connecting via WebSocket from a browser, do not pass the main API key — it will be visible in the code. For browser clients, use WebRTC with an ephemeral token generated on your backend:

// On your backend — generate ephemeral token
const response = await fetch("https://api.openai.com/v1/realtime/client_secrets", {
  method: "POST",
  headers: {
    Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    session: {
      type: "realtime",
      model: "gpt-realtime-2",
      audio: { output: { voice: "marin" } },
    },
  }),
});
const data = await response.json();
const ephemeralKey = data.value; // pass to client

❌ Not handling reconnect on session_expired

Session maximum is 60 minutes. Without reconnect logic, your agent will simply freeze after an hour. Implement automatic reconnection with the transfer of a shortened context of the previous conversation.

❌ Setting xhigh effort for all requests

xhigh = maximum quality, but also maximum latency (~2-3s) and maximum output tokens. For an FAQ agent with thousands of calls per month, the cost difference between low and xhigh can be orders of magnitude. Start with low, measure, and increase pointwise.

❌ Ignoring audio format

Realtime API expects PCM16 at 24,000 Hz. If your microphone or phone system provides a different format, convert it before sending. Incorrect format will result in either silence or distorted speech without an explicit API error.

❌ Changing voice after the first audio response

The voice is fixed after the model's first audio response. Attempting to change voice via session.update afterwards is ignored without error. If you need to change the voice, open a new session.

FAQ

Do I need a separate API key for Realtime API?
No. If you already have a key for GPT-4o or GPT-5, it will work for Realtime API as well. The same key, the same account on platform.openai.com.

How to test GPT-Realtime-2 without code?
Through the OpenAI Playground — it has an interface for Realtime models with a microphone directly in the browser. The fastest way to hear the model before writing code.

Can I use GPT-Realtime-2 via OpenRouter?
No. OpenRouter is built on HTTP infrastructure, Realtime API requires WebSocket. Architectural incompatibility.

What is the maximum session duration?
60 minutes. After that, the server closes the connection. Implement reconnect logic for longer interactions.

Is Ukrainian language supported?
Yes — for GPT-Realtime-2 (understands and responds), as an input language for GPT-Realtime-Translate (70+ input languages include Ukrainian), and for GPT-Realtime-Whisper (transcription).

What is semantic_vad and when to use it?
Voice Activity Detection — a mechanism that determines the end of a user's phrase. semantic_vad uses context understanding to avoid cutting off sentences mid-way. Recommended for most scenarios — better than classic silence-based VAD which cuts sentences at natural pauses.

How are tokens counted for audio?
1 second of audio ≈ 40 tokens. At a cost of $32/1M input tokens, 1 minute of input audio costs approximately $0.077. Detailed cost calculation — in the official documentation.

Conclusions

GPT-Realtime-2 is a different type of integration compared to the Chat Completions API. Event-driven model, persistent WebSocket connection, Base64-encoded audio chunks — if you haven't worked with this stack before, factor in time for familiarization.

But architecturally, it all comes down to a few key things:

  • Choose the right method: WebSocket (server), WebRTC (browser), SIP (telephony)
  • Open connection → receive session.created → send session.update
  • Stream audio in chunks via input_audio_buffer.append
  • Respond to response.audio.delta and play audio
  • Handle tool calls and return results via conversation.item.create

Start with the Playground for initial testing, then take the WebSocket example from this article and adapt it to your stack. Reasoning effort — start with low.

Context about the release itself, real-world cases, and model comparisons — in the news article:

OpenAI released GPT-Realtime-2: the first voice model with GPT-5 level thinking

If you are interested in the broader OpenAI ecosystem for developers:

Codex from OpenAI: A Complete Guide 2026

Sources: OpenAI Realtime API Overview, WebSocket Guide, WebRTC Guide, SIP Guide, Managing Conversations, Managing Costs