When Claude API starts returning 429 under high load, most systems don’t just slow down—they collapse: queue buildup, retry storms, upstream timeout chains, and pager noise.
This guide gives you a production-ready approach: adaptive concurrency + exponential backoff with full jitter + quota isolation. The goal is not “zero 429”, but keeping throttling inside a recoverable zone while preserving SLO.
Set Targets First: Optimize for Recovery, Not Zero Errors
Use three guardrail metrics:
- 429 ratio:
< 2%(5-minute window) - P95 end-to-end latency:
< 8s(including queue wait) - Retry amplification:
< 1.3x(average attempts per original request)
If these hold, users usually don’t feel the incident.
1) Replace Fixed Concurrency with Adaptive Concurrency
Why fixed limits fail
A hard-coded cap (for example, always 64) becomes dangerous when traffic spikes or provider quota changes suddenly.
AIMD pattern
- Increase slowly on healthy windows (Additive Increase)
- Cut fast on 429/5xx bursts (Multiplicative Decrease)
type Gate struct {
max int64 // atomic
}
func (g *Gate) OnHealthyWindow() {
cur := atomic.LoadInt64(&g.max)
atomic.StoreInt64(&g.max, min(cur+1, 128))
}
func (g *Gate) OnThrottle() {
cur := atomic.LoadInt64(&g.max)
next := int64(float64(cur) * 0.7)
if next < 4 { next = 4 }
atomic.StoreInt64(&g.max, next)
}
Production defaults:
- Upper bound: no more than
1.2xproven stable throughput - Lower bound: keep at least
4~8 - Allow increase only every
30~60sto avoid oscillation
2) Smart Retry Only: Exponential Backoff + Full Jitter
Immediate retry after 429 is the #1 amplifier of incidents.
Safer policy
- Initial retry base:
200~400ms - Backoff:
base * 2^attempt - Jitter:
rand(0, backoff)(Full Jitter) - Max retries:
2~3for interactive traffic
func backoff(attempt int, base, capMs int) time.Duration {
max := base * (1 << attempt)
if max > capMs { max = capMs }
return time.Duration(rand.Intn(max+1)) * time.Millisecond
}
Tie retry budget to timeout budget. If request budget is 8 seconds, don’t schedule a third retry at second 9.
3) Quota Isolation: One Tenant Must Not Sink Everyone
For multi-tenant gateways, shared global pools are risky.
Minimal viable isolation
- Global pool to protect provider quota
- Per-tenant pool to cap bursts
- Priority lanes: interactive > batch
Example split:
- Global concurrency: 60
- Guaranteed premium capacity: 20
- Shared regular capacity: 40
- Batch cap: 15 (preemptible)
Now batch spikes won’t drown critical online traffic.
4) Circuit Breaker + Half-Open Recovery
When 429/5xx crosses a threshold, short-circuit briefly:
- Open if 30s error rate > 25%
- Open for 10~20s
- Half-open probes at 5~10% traffic
- Gradual ramp-up only after probe success
Pair this with graceful degradation (cached summary, delayed response notice) instead of blind timeouts.
5) Observability: Without These Graphs, You’re Flying Blind
Track at least:
requests_total{model,status}retry_attempts_histogramqueue_wait_ms_p95inflight_by_tenantcircuit_breaker_state
Debug order:
- Check if 429 is concentrated in one model/tenant
- Verify queue wait inflation trend
- Confirm retry amplification is under control
6) Drop-In Emergency Config
Use this to stop the bleeding first:
llm_gateway:
timeout_ms: 8000
max_inflight_global: 48
max_inflight_per_tenant: 12
retry:
max_attempts: 3
base_backoff_ms: 250
cap_backoff_ms: 2500
jitter: full
circuit_breaker:
error_rate_threshold: 0.25
open_seconds: 15
half_open_probe_ratio: 0.1
Stabilize first, then tune against real traffic profiles.
Common Mistakes
- Mistake 1: Treating 429 as random network noise → infinite retry loop
- Mistake 2: Flat priority for all requests → critical traffic starved by batch jobs
- Mistake 3: Global limit only, no tenant isolation → one noisy tenant hurts all others
Summary
429 from Claude API is normal. Uncontrolled retry dynamics are not.
Three controls reduce incident risk dramatically:
- Adaptive concurrency
- Exponential backoff with full jitter
- Quota isolation plus circuit-breaker recovery
Build for graceful degradation, fast recovery, and observability. Stable systems monetize better than spiky systems.