When Claude API starts returning 429 under high load, most systems don’t just slow down—they collapse: queue buildup, retry storms, upstream timeout chains, and pager noise.

This guide gives you a production-ready approach: adaptive concurrency + exponential backoff with full jitter + quota isolation. The goal is not “zero 429”, but keeping throttling inside a recoverable zone while preserving SLO.

Set Targets First: Optimize for Recovery, Not Zero Errors

Use three guardrail metrics:

  • 429 ratio: < 2% (5-minute window)
  • P95 end-to-end latency: < 8s (including queue wait)
  • Retry amplification: < 1.3x (average attempts per original request)

If these hold, users usually don’t feel the incident.

1) Replace Fixed Concurrency with Adaptive Concurrency

Why fixed limits fail

A hard-coded cap (for example, always 64) becomes dangerous when traffic spikes or provider quota changes suddenly.

AIMD pattern

  • Increase slowly on healthy windows (Additive Increase)
  • Cut fast on 429/5xx bursts (Multiplicative Decrease)
type Gate struct {
    max int64 // atomic
}

func (g *Gate) OnHealthyWindow() {
    cur := atomic.LoadInt64(&g.max)
    atomic.StoreInt64(&g.max, min(cur+1, 128))
}

func (g *Gate) OnThrottle() {
    cur := atomic.LoadInt64(&g.max)
    next := int64(float64(cur) * 0.7)
    if next < 4 { next = 4 }
    atomic.StoreInt64(&g.max, next)
}

Production defaults:

  • Upper bound: no more than 1.2x proven stable throughput
  • Lower bound: keep at least 4~8
  • Allow increase only every 30~60s to avoid oscillation

2) Smart Retry Only: Exponential Backoff + Full Jitter

Immediate retry after 429 is the #1 amplifier of incidents.

Safer policy

  • Initial retry base: 200~400ms
  • Backoff: base * 2^attempt
  • Jitter: rand(0, backoff) (Full Jitter)
  • Max retries: 2~3 for interactive traffic
func backoff(attempt int, base, capMs int) time.Duration {
    max := base * (1 << attempt)
    if max > capMs { max = capMs }
    return time.Duration(rand.Intn(max+1)) * time.Millisecond
}

Tie retry budget to timeout budget. If request budget is 8 seconds, don’t schedule a third retry at second 9.

3) Quota Isolation: One Tenant Must Not Sink Everyone

For multi-tenant gateways, shared global pools are risky.

Minimal viable isolation

  • Global pool to protect provider quota
  • Per-tenant pool to cap bursts
  • Priority lanes: interactive > batch

Example split:

  • Global concurrency: 60
  • Guaranteed premium capacity: 20
  • Shared regular capacity: 40
  • Batch cap: 15 (preemptible)

Now batch spikes won’t drown critical online traffic.

4) Circuit Breaker + Half-Open Recovery

When 429/5xx crosses a threshold, short-circuit briefly:

  • Open if 30s error rate > 25%
  • Open for 10~20s
  • Half-open probes at 5~10% traffic
  • Gradual ramp-up only after probe success

Pair this with graceful degradation (cached summary, delayed response notice) instead of blind timeouts.

5) Observability: Without These Graphs, You’re Flying Blind

Track at least:

  • requests_total{model,status}
  • retry_attempts_histogram
  • queue_wait_ms_p95
  • inflight_by_tenant
  • circuit_breaker_state

Debug order:

  1. Check if 429 is concentrated in one model/tenant
  2. Verify queue wait inflation trend
  3. Confirm retry amplification is under control

6) Drop-In Emergency Config

Use this to stop the bleeding first:

llm_gateway:
  timeout_ms: 8000
  max_inflight_global: 48
  max_inflight_per_tenant: 12
  retry:
    max_attempts: 3
    base_backoff_ms: 250
    cap_backoff_ms: 2500
    jitter: full
  circuit_breaker:
    error_rate_threshold: 0.25
    open_seconds: 15
    half_open_probe_ratio: 0.1

Stabilize first, then tune against real traffic profiles.

Common Mistakes

  • Mistake 1: Treating 429 as random network noise → infinite retry loop
  • Mistake 2: Flat priority for all requests → critical traffic starved by batch jobs
  • Mistake 3: Global limit only, no tenant isolation → one noisy tenant hurts all others

Summary

429 from Claude API is normal. Uncontrolled retry dynamics are not.

Three controls reduce incident risk dramatically:

  1. Adaptive concurrency
  2. Exponential backoff with full jitter
  3. Quota isolation plus circuit-breaker recovery

Build for graceful degradation, fast recovery, and observability. Stable systems monetize better than spiky systems.