The most expensive outage is not a single failure — it is a failure amplified by retries.

In an OpenAI Responses + Go tool-calling stack, missing idempotency, jittered backoff, and breaker thresholds can turn 10 failing requests into 1000 downstream calls in minutes.

TL;DR: You need all three guardrails

  1. Idempotency key: one business action should apply once.
  2. Backoff + jitter: retries must spread out, not synchronize.
  3. Circuit breaker threshold: fail fast when error budget is blown.

How retry storms usually start

Common bad setup:

  • HTTP timeout too short (for example, 3 seconds)
  • Gateway retries 3 times + service retries 3 times
  • No idempotency control in tool execution
  • Fixed retry interval on all instances (no jitter)

What happens next:

  • A tiny upstream hiccup is amplified 9x to 27x
  • P95 latency spikes and queues pile up
  • Alerts fan out across API errors, DB lock contention, and cache misses

Go implementation: idempotency keys

Recommended key format:

idem:{tenant}:{workflow}:{biz_id}:{step}

Rules:

  • Build from business-unique fields (not random UUIDs)
  • TTL must cover your max retry window (for example, 15 minutes)
  • Store status, response hash, first-seen and last-updated timestamps

Redis example (SETNX + TTL):

ok, err := rdb.SetNX(ctx, idemKey, "PENDING", 15*time.Minute).Result()
if err != nil {
    return err
}
if !ok {
    // Existing execution found: return cached outcome
    return ErrDuplicateSuppressed
}

Write a result summary after success:

_ = rdb.Set(ctx, idemKey, "DONE:tool_result_hash", 15*time.Minute).Err()

Go implementation: exponential backoff with full jitter

Wrong: fixed sleep(500ms).

Right: exponential backoff + full jitter:

func backoff(attempt int, base, cap time.Duration) time.Duration {
    max := base << attempt
    if max > cap {
        max = cap
    }
    return time.Duration(rand.Int63n(int64(max)))
}

Conservative defaults:

  • base = 200ms
  • cap = 5s
  • maxAttempts = 4
  • Retry only retryable classes (429/5xx/transient network errors)

Go implementation: breaker thresholds with error budget

Use a 30-second sliding window:

  • requests >= 50
  • error rate >= 25%
  • trigger in 2 consecutive windows → open for 20 seconds

Pseudo code:

if window.Req >= 50 && window.ErrRate() >= 0.25 {
    breaker.Trip(20 * time.Second)
}
if breaker.Open() {
    return ErrFastFail
}

Fallback policy when open:

  • Return cached summary or last known good result
  • Skip non-critical tools
  • Tell users output may be partial

Metrics you must ship

At minimum:

  • tool_call_total{tool,status}
  • retry_total{reason}
  • idempotency_suppressed_total
  • breaker_open_total
  • llm_latency_ms_p95
  • cost_usd_total

Alert ideas:

  • retry_total > 3x baseline in 5 minutes
  • sudden jump in idempotency_suppressed_total
  • sustained breaker_open_total > 0

Troubleshooting checklist

  1. Check 429/5xx ratio in the last 15 minutes.
  2. Confirm you do not have double retry layers.
  3. Sample failing requests and verify key stability.
  4. Verify retries are jittered, not fixed sleep.
  5. Check breaker open/half-open recovery behavior.
  6. Reconcile duplicate writes or duplicate charges.

Summary

Retries are not free.

In production Responses + Go pipelines, idempotency first, jittered retries second, circuit breaker third is the practical order that turns a potential avalanche into controlled degradation.

If you can do only one thing today: add idempotency keys first. It usually delivers the highest ROI immediately.