OpenAI Realtime + Go in Production: WebRTC Token Rotation, Interruption Recovery, and End-to-End Latency Budgets

If you plan to put OpenAI Realtime into production, do not let a passing demo fool you.

What usually breaks the system is not the model itself. It is non-rotating short-lived auth, missing interruption state, and zero end-to-end latency budgeting. Miss those three and your voice UX starts sounding like an angry walkie-talkie.

TL;DR: protect these four things first

Use WebRTC for browser/mobile voice. Use WebSocket only for server-side bridging.
Issue short-lived session tokens and rotate them before expiry.
Treat user interruption as a real state machine event, not a UI hack.
Split latency into capture, uplink, inference, downlink, and playback.

When to choose WebRTC and when to choose WebSocket

The practical rule is simple:

Direct user voice/audio: prefer WebRTC
Server-side transcription bridges, recording, orchestration: WebSocket can work
Need low-latency voice plus server audit/routing: keep WebRTC at the edge and send event summaries to the server instead of piping full audio around

Why:

WebRTC is built for real-time media transport
It handles NAT, jitter, and packet loss better than a hand-rolled WebSocket voice path in most real networks
WebSocket is fine for events, but a poor choice for carrying the full burden of last-mile voice UX

Auth rotation: never ship long-lived keys to clients

Wrong pattern:

Frontend holds a long-lived API key
No TTL on tokens
Reconnect keeps using an old token until a 401 blows up the session

Right pattern:

Your Go service owns the real credential.
The service issues a short-lived session token to the client.
TTL stays around 1 to 5 minutes.
Rotate 30 seconds before expiry; if rotation fails, reconnect gracefully.

Example Go handler for issuing short-lived sessions:

type RealtimeSessionRequest struct {
    UserID    string `json:"user_id"`
    DeviceID  string `json:"device_id"`
    Voice     string `json:"voice"`
    ExpiresIn int    `json:"expires_in"`
}

func issueRealtimeSession(w http.ResponseWriter, r *http.Request) {
    req := RealtimeSessionRequest{
        UserID:    mustUserID(r.Context()),
        DeviceID:  r.Header.Get("X-Device-ID"),
        Voice:     "alloy",
        ExpiresIn: 300,
    }

    body, _ := json.Marshal(map[string]any{
        "model":      "gpt-4o-realtime-preview",
        "voice":      req.Voice,
        "expires_in": req.ExpiresIn,
    })

    upstreamReq, _ := http.NewRequest("POST", "https://api.openai.com/v1/realtime/sessions", bytes.NewReader(body))
    upstreamReq.Header.Set("Authorization", "Bearer "+os.Getenv("OPENAI_API_KEY"))
    upstreamReq.Header.Set("Content-Type", "application/json")

    resp, err := http.DefaultClient.Do(upstreamReq)
    if err != nil {
        http.Error(w, "issue session failed", http.StatusBadGateway)
        return
    }
    defer resp.Body.Close()
    io.Copy(w, resp.Body)
}

Rotation rule:

func shouldRotate(expireAt time.Time, now time.Time) bool {
    return expireAt.Sub(now) <= 30*time.Second
}

Interruption recovery: stop the response transaction, not just the speaker

The most common fake-stable setup in realtime voice is this: the frontend stops playback, but the backend/model side is still generating.

That leads to:

the user starts a second utterance while the previous response is still alive
transcript stitching goes out of order
the UI thinks it recovered while the model thinks the old turn is still active

Use an explicit state machine instead:

IDLE -> LISTENING -> THINKING -> SPEAKING
SPEAKING --interrupt--> CANCELLING -> LISTENING
THINKING --timeout/error--> RECOVERING -> LISTENING

At minimum, persist:

session_id
conversation_id
response_id
last_user_audio_seq
last_ack_event_id

When the user interrupts, do three things:

stop local playback
send cancel / truncate upstream
clear the current response_id and move back to LISTENING

Example:

type SessionState struct {
    SessionID      string
    ConversationID string
    ResponseID     string
    Phase          string
}

func interrupt(state *SessionState, send func(any) error) error {
    if state.ResponseID == "" {
        state.Phase = "LISTENING"
        return nil
    }

    evt := map[string]any{
        "type":        "response.cancel",
        "response_id": state.ResponseID,
    }
    if err := send(evt); err != nil {
        state.Phase = "RECOVERING"
        return err
    }

    state.ResponseID = ""
    state.Phase = "LISTENING"
    return nil
}

End-to-end latency budgets: do not stare only at model time

In conservative mode, I would rather get you to a stable 800ms to 1500ms perceived first response than chase fantasy 300ms lab numbers.

Split the bill into five segments:

Capture: local capture and VAD chunking
Uplink: client-to-upstream transport
Inference: model reasoning and event generation
Downlink: event delivery back to the client
Playback: player buffering and speech output

A reasonable conservative budget:

Capture: 80ms - 150ms
Uplink: 80ms - 150ms
Inference: 250ms - 700ms
Downlink: 50ms - 120ms
Playback: 80ms - 250ms

At minimum, instrument these histograms in Go:

var (
    captureMs   = prometheus.NewHistogram(...)
    uplinkMs    = prometheus.NewHistogram(...)
    inferMs     = prometheus.NewHistogram(...)
    downlinkMs  = prometheus.NewHistogram(...)
    playbackMs  = prometheus.NewHistogram(...)
    e2eMs       = prometheus.NewHistogram(...)
)

Suggested reporting flow:

start := time.Now()
// capture done
captureMs.Observe(float64(time.Since(start).Milliseconds()))

uplinkStart := time.Now()
// send audio chunk
uplinkMs.Observe(float64(time.Since(uplinkStart).Milliseconds()))

inferStart := time.Now()
// first model event arrived
inferMs.Observe(float64(time.Since(inferStart).Milliseconds()))

WebRTC recovery: split reconnects into two classes

Do not treat every disconnect as the same problem. At least separate these:

1) transport-layer blips

Symptoms:

ICE failure
network switch (Wi‑Fi -> 4G)
temporary packet loss causing media interruption

Action:

rebuild the PeerConnection first
try to reuse the session context
check whether the token is near expiry before renegotiation

2) session-layer failures

Symptoms:

expired token
upstream 401/403
broken response/event ordering

Action:

issue a new session token
create a new session
restore only a short necessary summary instead of replaying all historical audio

Recovery pseudo code:

func recoverSession(reason string) RecoveryPlan {
    switch reason {
    case "ice_failed", "network_switch":
        return RecoveryPlan{RebuildPeer: true, ReissueToken: false, ReplaySummary: true}
    case "token_expired", "auth_failed":
        return RecoveryPlan{RebuildPeer: true, ReissueToken: true, ReplaySummary: true}
    default:
        return RecoveryPlan{RebuildPeer: true, ReissueToken: true, ReplaySummary: false}
    }
}

A production baseline that is actually enough

Before launch, add at least these six guardrails:

short-lived token + rotation
explicit interruption state machine
first-byte and full-turn latency metrics
classified alerts for 401 and ICE failures
auto-recovery tests after client network switching
server-side summary replay instead of unbounded session growth

Recommended alerts:

realtime_first_audio_p95 > 1500ms
session_rotate_fail_total > 0
interrupt_recovery_fail_rate > 5%
sharp increase in ice_restart_total within 15 minutes

Troubleshooting order: do not blame the model first

When something breaks, check in this order:

Auth: is the client holding an expired token?
Connectivity: did ICE / NAT / network switching break media flow?
State machine: is a stale response_id left behind after interruption?
Budget: is the bottleneck capture, transport, inference, or playback?
Replay policy: are you shoving too much old context back during recovery?

Minimum viable production setup

If your goal is stability first, not showmanship, start here:

Use WebRTC for the edge voice path
Let the Go service only issue short-lived tokens and store event summaries
Set token TTL to 300 seconds and rotate 30 seconds early
Force response.cancel on interruption
Get first response under 1.5 seconds before chasing fancier optimizations

Summary

The hard part of OpenAI Realtime is not “connecting voice.” It is maintaining a stable experience under flaky networks, interruptible interaction, and short-lived credentials.

In production, the best order of work is simple: lock down auth first, fix interruption state second, then budget latency. Do those three and the system starts acting like a product instead of a demo.

TL;DR: protect these four things first#

When to choose WebRTC and when to choose WebSocket#

Auth rotation: never ship long-lived keys to clients#

Interruption recovery: stop the response transaction, not just the speaker#

End-to-end latency budgets: do not stare only at model time#

WebRTC recovery: split reconnects into two classes#

1) transport-layer blips#

2) session-layer failures#

A production baseline that is actually enough#

Troubleshooting order: do not blame the model first#

Minimum viable production setup#

Summary#