If you plan to put OpenAI Realtime into production, do not let a passing demo fool you.

What usually breaks the system is not the model itself. It is non-rotating short-lived auth, missing interruption state, and zero end-to-end latency budgeting. Miss those three and your voice UX starts sounding like an angry walkie-talkie.

TL;DR: protect these four things first

  1. Use WebRTC for browser/mobile voice. Use WebSocket only for server-side bridging.
  2. Issue short-lived session tokens and rotate them before expiry.
  3. Treat user interruption as a real state machine event, not a UI hack.
  4. Split latency into capture, uplink, inference, downlink, and playback.

When to choose WebRTC and when to choose WebSocket

The practical rule is simple:

  • Direct user voice/audio: prefer WebRTC
  • Server-side transcription bridges, recording, orchestration: WebSocket can work
  • Need low-latency voice plus server audit/routing: keep WebRTC at the edge and send event summaries to the server instead of piping full audio around

Why:

  • WebRTC is built for real-time media transport
  • It handles NAT, jitter, and packet loss better than a hand-rolled WebSocket voice path in most real networks
  • WebSocket is fine for events, but a poor choice for carrying the full burden of last-mile voice UX

Auth rotation: never ship long-lived keys to clients

Wrong pattern:

  • Frontend holds a long-lived API key
  • No TTL on tokens
  • Reconnect keeps using an old token until a 401 blows up the session

Right pattern:

  1. Your Go service owns the real credential.
  2. The service issues a short-lived session token to the client.
  3. TTL stays around 1 to 5 minutes.
  4. Rotate 30 seconds before expiry; if rotation fails, reconnect gracefully.

Example Go handler for issuing short-lived sessions:

type RealtimeSessionRequest struct {
    UserID    string `json:"user_id"`
    DeviceID  string `json:"device_id"`
    Voice     string `json:"voice"`
    ExpiresIn int    `json:"expires_in"`
}

func issueRealtimeSession(w http.ResponseWriter, r *http.Request) {
    req := RealtimeSessionRequest{
        UserID:    mustUserID(r.Context()),
        DeviceID:  r.Header.Get("X-Device-ID"),
        Voice:     "alloy",
        ExpiresIn: 300,
    }

    body, _ := json.Marshal(map[string]any{
        "model":      "gpt-4o-realtime-preview",
        "voice":      req.Voice,
        "expires_in": req.ExpiresIn,
    })

    upstreamReq, _ := http.NewRequest("POST", "https://api.openai.com/v1/realtime/sessions", bytes.NewReader(body))
    upstreamReq.Header.Set("Authorization", "Bearer "+os.Getenv("OPENAI_API_KEY"))
    upstreamReq.Header.Set("Content-Type", "application/json")

    resp, err := http.DefaultClient.Do(upstreamReq)
    if err != nil {
        http.Error(w, "issue session failed", http.StatusBadGateway)
        return
    }
    defer resp.Body.Close()
    io.Copy(w, resp.Body)
}

Rotation rule:

func shouldRotate(expireAt time.Time, now time.Time) bool {
    return expireAt.Sub(now) <= 30*time.Second
}

Interruption recovery: stop the response transaction, not just the speaker

The most common fake-stable setup in realtime voice is this: the frontend stops playback, but the backend/model side is still generating.

That leads to:

  • the user starts a second utterance while the previous response is still alive
  • transcript stitching goes out of order
  • the UI thinks it recovered while the model thinks the old turn is still active

Use an explicit state machine instead:

IDLE -> LISTENING -> THINKING -> SPEAKING
SPEAKING --interrupt--> CANCELLING -> LISTENING
THINKING --timeout/error--> RECOVERING -> LISTENING

At minimum, persist:

  • session_id
  • conversation_id
  • response_id
  • last_user_audio_seq
  • last_ack_event_id

When the user interrupts, do three things:

  1. stop local playback
  2. send cancel / truncate upstream
  3. clear the current response_id and move back to LISTENING

Example:

type SessionState struct {
    SessionID      string
    ConversationID string
    ResponseID     string
    Phase          string
}

func interrupt(state *SessionState, send func(any) error) error {
    if state.ResponseID == "" {
        state.Phase = "LISTENING"
        return nil
    }

    evt := map[string]any{
        "type":        "response.cancel",
        "response_id": state.ResponseID,
    }
    if err := send(evt); err != nil {
        state.Phase = "RECOVERING"
        return err
    }

    state.ResponseID = ""
    state.Phase = "LISTENING"
    return nil
}

End-to-end latency budgets: do not stare only at model time

In conservative mode, I would rather get you to a stable 800ms to 1500ms perceived first response than chase fantasy 300ms lab numbers.

Split the bill into five segments:

  1. Capture: local capture and VAD chunking
  2. Uplink: client-to-upstream transport
  3. Inference: model reasoning and event generation
  4. Downlink: event delivery back to the client
  5. Playback: player buffering and speech output

A reasonable conservative budget:

  • Capture: 80ms - 150ms
  • Uplink: 80ms - 150ms
  • Inference: 250ms - 700ms
  • Downlink: 50ms - 120ms
  • Playback: 80ms - 250ms

At minimum, instrument these histograms in Go:

var (
    captureMs   = prometheus.NewHistogram(...)
    uplinkMs    = prometheus.NewHistogram(...)
    inferMs     = prometheus.NewHistogram(...)
    downlinkMs  = prometheus.NewHistogram(...)
    playbackMs  = prometheus.NewHistogram(...)
    e2eMs       = prometheus.NewHistogram(...)
)

Suggested reporting flow:

start := time.Now()
// capture done
captureMs.Observe(float64(time.Since(start).Milliseconds()))

uplinkStart := time.Now()
// send audio chunk
uplinkMs.Observe(float64(time.Since(uplinkStart).Milliseconds()))

inferStart := time.Now()
// first model event arrived
inferMs.Observe(float64(time.Since(inferStart).Milliseconds()))

WebRTC recovery: split reconnects into two classes

Do not treat every disconnect as the same problem. At least separate these:

1) transport-layer blips

Symptoms:

  • ICE failure
  • network switch (Wi‑Fi -> 4G)
  • temporary packet loss causing media interruption

Action:

  • rebuild the PeerConnection first
  • try to reuse the session context
  • check whether the token is near expiry before renegotiation

2) session-layer failures

Symptoms:

  • expired token
  • upstream 401/403
  • broken response/event ordering

Action:

  • issue a new session token
  • create a new session
  • restore only a short necessary summary instead of replaying all historical audio

Recovery pseudo code:

func recoverSession(reason string) RecoveryPlan {
    switch reason {
    case "ice_failed", "network_switch":
        return RecoveryPlan{RebuildPeer: true, ReissueToken: false, ReplaySummary: true}
    case "token_expired", "auth_failed":
        return RecoveryPlan{RebuildPeer: true, ReissueToken: true, ReplaySummary: true}
    default:
        return RecoveryPlan{RebuildPeer: true, ReissueToken: true, ReplaySummary: false}
    }
}

A production baseline that is actually enough

Before launch, add at least these six guardrails:

  1. short-lived token + rotation
  2. explicit interruption state machine
  3. first-byte and full-turn latency metrics
  4. classified alerts for 401 and ICE failures
  5. auto-recovery tests after client network switching
  6. server-side summary replay instead of unbounded session growth

Recommended alerts:

  • realtime_first_audio_p95 > 1500ms
  • session_rotate_fail_total > 0
  • interrupt_recovery_fail_rate > 5%
  • sharp increase in ice_restart_total within 15 minutes

Troubleshooting order: do not blame the model first

When something breaks, check in this order:

  1. Auth: is the client holding an expired token?
  2. Connectivity: did ICE / NAT / network switching break media flow?
  3. State machine: is a stale response_id left behind after interruption?
  4. Budget: is the bottleneck capture, transport, inference, or playback?
  5. Replay policy: are you shoving too much old context back during recovery?

Minimum viable production setup

If your goal is stability first, not showmanship, start here:

  • Use WebRTC for the edge voice path
  • Let the Go service only issue short-lived tokens and store event summaries
  • Set token TTL to 300 seconds and rotate 30 seconds early
  • Force response.cancel on interruption
  • Get first response under 1.5 seconds before chasing fancier optimizations

Summary

The hard part of OpenAI Realtime is not “connecting voice.” It is maintaining a stable experience under flaky networks, interruptible interaction, and short-lived credentials.

In production, the best order of work is simple: lock down auth first, fix interruption state second, then budget latency. Do those three and the system starts acting like a product instead of a demo.