Handling OpenAI 429/5xx Storms in Go: Token Bucket, Exponential Backoff, and Circuit Breakers

Most Go teams are not killed by a single API error. They are killed by a retry storm they created themselves.

TL;DR

When OpenAI starts returning 429 and 5xx under pressure, the stable pattern is:

Rate-limit at ingress with a token bucket.
Retry only recoverable failures with exponential backoff + jitter.
Trip a circuit breaker on sustained failure windows.
Use request budgets + idempotency keys to prevent runaway retries and duplicate side effects.
Track the right metrics (429 ratio, retry depth, breaker state).

Why simple retries fail

A common production loop looks like this:

traffic spike raises concurrency
upstream rate limit returns 429
app retries immediately (without jitter)
retries multiply load
5xx increases
even more retries

That is a positive feedback loop, not resilience.

1) Token bucket first: control concurrency before anything else

import "golang.org/x/time/rate"

var limiter = rate.NewLimiter(rate.Limit(8), 16) // 8 rps, burst 16

func allow(ctx context.Context) error {
    return limiter.Wait(ctx)
}

Practical baseline:

online traffic: per-tenant + global limit
batch jobs: isolated queue and quota (never steal online quota)

2) Backoff with jitter: retry only what is retryable

func backoff(attempt int) time.Duration {
    base := 200 * time.Millisecond
    max := 5 * time.Second
    d := base * time.Duration(1<<attempt)
    if d > max { d = max }
    jitter := time.Duration(rand.Int63n(int64(d / 2)))
    return d/2 + jitter
}

func retryable(status int) bool {
    if status == 429 { return true }
    if status >= 500 && status <= 599 { return true }
    return false
}

Rules that save real systems:

don’t retry most 4xx
respect Retry-After when present
cap retries to 2–3 attempts per request

3) Circuit breaker: fail fast when failure ratio is sustained

Using sony/gobreaker:

cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name: "openai-responses",
    Interval: 30 * time.Second,
    Timeout: 20 * time.Second,
    ReadyToTrip: func(c gobreaker.Counts) bool {
        if c.Requests < 20 { return false }
        return float64(c.TotalFailures)/float64(c.Requests) >= 0.5
    },
})

Open state protects both upstream and your own workers from meltdown.

4) Request budget: retries must fit your SLA, not exceed it

For an 8-second end-to-end budget:

first attempt: 3.5s
two retries: 1.5s each
remaining headroom: network tail and serialization

Budget is a hard boundary, not a suggestion.

5) Idempotency keys: prevent duplicate cost and duplicate writes

Build key from business identity + payload fingerprint:

idempotency_key = sha256(user_id + task_id + payload_hash)
short TTL result cache
return cached result on replay

Production incident checklist

check 15-minute metrics: 429_rate, 5xx_rate, retry_attempt_avg
if 429_rate > 5%: reduce token bucket rate by 20%
if 5xx_rate > 10%: open breaker and pause non-critical traffic
verify jitter is enabled
verify Retry-After handling
verify batch traffic cannot consume online capacity

Minimal middleware skeleton

func CallOpenAI(ctx context.Context, req *http.Request) (*http.Response, error) {
    if err := allow(ctx); err != nil { return nil, err }

    var lastErr error
    for attempt := 0; attempt <= 2; attempt++ {
        resp, err := cb.Execute(func() (interface{}, error) {
            cctx, cancel := context.WithTimeout(ctx, 3500*time.Millisecond)
            defer cancel()
            return client.Do(req.WithContext(cctx))
        })
        if err == nil {
            r := resp.(*http.Response)
            if !retryable(r.StatusCode) { return r, nil }
            lastErr = fmt.Errorf("retryable status=%d", r.StatusCode)
        } else {
            lastErr = err
        }

        if attempt == 2 { break }
        time.Sleep(backoff(attempt))
    }
    return nil, lastErr
}

Final takeaway

If you only “retry harder,” you amplify failures. If you control flow, budget retries, and break circuits on bad windows, your Go + OpenAI stack behaves like a production system instead of a lucky demo.

TL;DR#

Why simple retries fail#

1) Token bucket first: control concurrency before anything else#

2) Backoff with jitter: retry only what is retryable#

3) Circuit breaker: fail fast when failure ratio is sustained#

4) Request budget: retries must fit your SLA, not exceed it#

5) Idempotency keys: prevent duplicate cost and duplicate writes#

Production incident checklist#

Minimal middleware skeleton#

Final takeaway#