If you plan to put OpenAI Realtime into production, do not let a passing demo fool you.
What usually breaks the system is not the model itself. It is non-rotating short-lived auth, missing interruption state, and zero end-to-end latency budgeting. Miss those three and your voice UX starts sounding like an angry walkie-talkie.
TL;DR: protect these four things first
- Use WebRTC for browser/mobile voice. Use WebSocket only for server-side bridging.
- Issue short-lived session tokens and rotate them before expiry.
- Treat user interruption as a real state machine event, not a UI hack.
- Split latency into capture, uplink, inference, downlink, and playback.
When to choose WebRTC and when to choose WebSocket
The practical rule is simple:
- Direct user voice/audio: prefer
WebRTC - Server-side transcription bridges, recording, orchestration:
WebSocketcan work - Need low-latency voice plus server audit/routing: keep
WebRTCat the edge and send event summaries to the server instead of piping full audio around
Why:
- WebRTC is built for real-time media transport
- It handles NAT, jitter, and packet loss better than a hand-rolled WebSocket voice path in most real networks
- WebSocket is fine for events, but a poor choice for carrying the full burden of last-mile voice UX
Auth rotation: never ship long-lived keys to clients
Wrong pattern:
- Frontend holds a long-lived API key
- No TTL on tokens
- Reconnect keeps using an old token until a 401 blows up the session
Right pattern:
- Your Go service owns the real credential.
- The service issues a short-lived session token to the client.
- TTL stays around 1 to 5 minutes.
- Rotate 30 seconds before expiry; if rotation fails, reconnect gracefully.
Example Go handler for issuing short-lived sessions:
type RealtimeSessionRequest struct {
UserID string `json:"user_id"`
DeviceID string `json:"device_id"`
Voice string `json:"voice"`
ExpiresIn int `json:"expires_in"`
}
func issueRealtimeSession(w http.ResponseWriter, r *http.Request) {
req := RealtimeSessionRequest{
UserID: mustUserID(r.Context()),
DeviceID: r.Header.Get("X-Device-ID"),
Voice: "alloy",
ExpiresIn: 300,
}
body, _ := json.Marshal(map[string]any{
"model": "gpt-4o-realtime-preview",
"voice": req.Voice,
"expires_in": req.ExpiresIn,
})
upstreamReq, _ := http.NewRequest("POST", "https://api.openai.com/v1/realtime/sessions", bytes.NewReader(body))
upstreamReq.Header.Set("Authorization", "Bearer "+os.Getenv("OPENAI_API_KEY"))
upstreamReq.Header.Set("Content-Type", "application/json")
resp, err := http.DefaultClient.Do(upstreamReq)
if err != nil {
http.Error(w, "issue session failed", http.StatusBadGateway)
return
}
defer resp.Body.Close()
io.Copy(w, resp.Body)
}
Rotation rule:
func shouldRotate(expireAt time.Time, now time.Time) bool {
return expireAt.Sub(now) <= 30*time.Second
}
Interruption recovery: stop the response transaction, not just the speaker
The most common fake-stable setup in realtime voice is this: the frontend stops playback, but the backend/model side is still generating.
That leads to:
- the user starts a second utterance while the previous response is still alive
- transcript stitching goes out of order
- the UI thinks it recovered while the model thinks the old turn is still active
Use an explicit state machine instead:
IDLE -> LISTENING -> THINKING -> SPEAKING
SPEAKING --interrupt--> CANCELLING -> LISTENING
THINKING --timeout/error--> RECOVERING -> LISTENING
At minimum, persist:
session_idconversation_idresponse_idlast_user_audio_seqlast_ack_event_id
When the user interrupts, do three things:
- stop local playback
- send cancel / truncate upstream
- clear the current
response_idand move back toLISTENING
Example:
type SessionState struct {
SessionID string
ConversationID string
ResponseID string
Phase string
}
func interrupt(state *SessionState, send func(any) error) error {
if state.ResponseID == "" {
state.Phase = "LISTENING"
return nil
}
evt := map[string]any{
"type": "response.cancel",
"response_id": state.ResponseID,
}
if err := send(evt); err != nil {
state.Phase = "RECOVERING"
return err
}
state.ResponseID = ""
state.Phase = "LISTENING"
return nil
}
End-to-end latency budgets: do not stare only at model time
In conservative mode, I would rather get you to a stable 800ms to 1500ms perceived first response than chase fantasy 300ms lab numbers.
Split the bill into five segments:
- Capture: local capture and VAD chunking
- Uplink: client-to-upstream transport
- Inference: model reasoning and event generation
- Downlink: event delivery back to the client
- Playback: player buffering and speech output
A reasonable conservative budget:
- Capture: 80ms - 150ms
- Uplink: 80ms - 150ms
- Inference: 250ms - 700ms
- Downlink: 50ms - 120ms
- Playback: 80ms - 250ms
At minimum, instrument these histograms in Go:
var (
captureMs = prometheus.NewHistogram(...)
uplinkMs = prometheus.NewHistogram(...)
inferMs = prometheus.NewHistogram(...)
downlinkMs = prometheus.NewHistogram(...)
playbackMs = prometheus.NewHistogram(...)
e2eMs = prometheus.NewHistogram(...)
)
Suggested reporting flow:
start := time.Now()
// capture done
captureMs.Observe(float64(time.Since(start).Milliseconds()))
uplinkStart := time.Now()
// send audio chunk
uplinkMs.Observe(float64(time.Since(uplinkStart).Milliseconds()))
inferStart := time.Now()
// first model event arrived
inferMs.Observe(float64(time.Since(inferStart).Milliseconds()))
WebRTC recovery: split reconnects into two classes
Do not treat every disconnect as the same problem. At least separate these:
1) transport-layer blips
Symptoms:
- ICE failure
- network switch (Wi‑Fi -> 4G)
- temporary packet loss causing media interruption
Action:
- rebuild the PeerConnection first
- try to reuse the session context
- check whether the token is near expiry before renegotiation
2) session-layer failures
Symptoms:
- expired token
- upstream 401/403
- broken response/event ordering
Action:
- issue a new session token
- create a new session
- restore only a short necessary summary instead of replaying all historical audio
Recovery pseudo code:
func recoverSession(reason string) RecoveryPlan {
switch reason {
case "ice_failed", "network_switch":
return RecoveryPlan{RebuildPeer: true, ReissueToken: false, ReplaySummary: true}
case "token_expired", "auth_failed":
return RecoveryPlan{RebuildPeer: true, ReissueToken: true, ReplaySummary: true}
default:
return RecoveryPlan{RebuildPeer: true, ReissueToken: true, ReplaySummary: false}
}
}
A production baseline that is actually enough
Before launch, add at least these six guardrails:
- short-lived token + rotation
- explicit interruption state machine
- first-byte and full-turn latency metrics
- classified alerts for 401 and ICE failures
- auto-recovery tests after client network switching
- server-side summary replay instead of unbounded session growth
Recommended alerts:
realtime_first_audio_p95 > 1500mssession_rotate_fail_total > 0interrupt_recovery_fail_rate > 5%- sharp increase in
ice_restart_totalwithin 15 minutes
Troubleshooting order: do not blame the model first
When something breaks, check in this order:
- Auth: is the client holding an expired token?
- Connectivity: did ICE / NAT / network switching break media flow?
- State machine: is a stale
response_idleft behind after interruption? - Budget: is the bottleneck capture, transport, inference, or playback?
- Replay policy: are you shoving too much old context back during recovery?
Minimum viable production setup
If your goal is stability first, not showmanship, start here:
- Use WebRTC for the edge voice path
- Let the Go service only issue short-lived tokens and store event summaries
- Set token TTL to 300 seconds and rotate 30 seconds early
- Force
response.cancelon interruption - Get first response under 1.5 seconds before chasing fancier optimizations
Summary
The hard part of OpenAI Realtime is not “connecting voice.” It is maintaining a stable experience under flaky networks, interruptible interaction, and short-lived credentials.
In production, the best order of work is simple: lock down auth first, fix interruption state second, then budget latency. Do those three and the system starts acting like a product instead of a demo.