Production streaming fails in two predictable ways: users wait while the stream silently drops, and your logs say “timeout” without telling you where it actually broke.

This guide gives you a practical Go pattern for OpenAI Responses API streaming with strict timeout boundaries, safe retries, and useful telemetry.

1) Define retry boundaries first

Do not retry every error.

  • Retryable: 429, 5xx, transient network resets, upstream gateway timeout
  • Non-retryable: 401/403, invalid request payload, explicit context cancellation
  • Conditional: mid-stream disconnects (depends on whether continuation is acceptable)

Retrying 4xx blindly is just automated self-harm.

Use three layers:

  1. Request context timeout (hard deadline, e.g. 45s)
  2. HTTP client timeout (connection guard, e.g. 50s)
  3. Stream idle timeout (abort when no token arrives for too long)
ctx, cancel := context.WithTimeout(r.Context(), 45*time.Second)
defer cancel()

httpClient := &http.Client{Timeout: 50 * time.Second}

Keep HTTP timeout slightly larger than context timeout.

3) Exponential backoff with jitter

Practical defaults:

  • Initial backoff: 300-500ms
  • Exponential growth with cap (e.g. 5s)
  • Add jitter to avoid synchronized retry spikes
  • Max attempts: 3-5
func backoff(attempt int) time.Duration {
    base := 400 * time.Millisecond
    max := 5 * time.Second
    d := base * time.Duration(1<<attempt)
    if d > max {
        d = max
    }
    jitter := time.Duration(rand.Int63n(int64(d / 4)))
    return d + jitter
}

4) Minimum observability signals

At least emit these metrics:

  • llm_stream_first_token_ms
  • llm_stream_total_duration_ms
  • llm_stream_tokens_in
  • llm_stream_tokens_out
  • llm_stream_retry_count
  • llm_stream_error_total{code,type}

Attach trace_id per request so you can correlate request entry, model latency, and stream completion.

5) Retry-capable stream skeleton

func StreamWithRetry(ctx context.Context, req *Request) error {
    var lastErr error
    for attempt := 0; attempt < 4; attempt++ {
        start := time.Now()
        err := streamOnce(ctx, req)
        recordMetrics(time.Since(start), attempt, err)

        if err == nil {
            return nil
        }
        if !isRetryable(err) {
            return err
        }

        lastErr = err
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(backoff(attempt)):
        }
    }
    return fmt.Errorf("stream failed after retries: %w", lastErr)
}

6) Common failure patterns

  • Not canceling context when client disconnects (goroutine leaks)
  • Buffering the whole answer before returning (not really streaming)
  • Tracking failures only, not first-token latency
  • Retrying without error-type filtering

7) Pre-launch validation checklist

  • Inject 429 and verify backoff + retries
  • Inject 5xx and verify final error observability
  • Simulate short network drop and verify recovery
  • Run concurrency test (50/100) and watch p95 first-token latency

Summary

Reliable Responses API streaming in Go is mostly engineering hygiene:

  1. Error boundaries
  2. Time boundaries
  3. Observability boundaries

Build these three first, then optimize prompt and model quality.