Reliability on Mengboy Tech Notes

Claude API Rate-Limit Storm Playbook: Adaptive Concurrency, Jittered Backoff, and Quota Isolation

Fri, 03 Apr 2026 01:15:05 +0000

When Claude API starts returning 429 under high load, most systems don’t just slow down—they collapse: queue buildup, retry storms, upstream timeout chains, and pager noise.

Claude + OpenAI Dual-Provider Gateway Failover: Health Probes, Circuit Breaking, and SLA Fallback

Mon, 30 Mar 2026 01:14:00 +0000

If your production stack calls both Claude and OpenAI, the hard part is not API integration. The hard part is keeping user experience stable when one provider starts throwing 429/5xx spikes, regional latency, or timeout storms.

This guide gives you a practical dual-provider gateway playbook: health probes, circuit breaking, SLA-aware fallback, and observability loops. The goal is not “never fail.” The goal is controlled failure with controlled cost and controlled latency.

OpenAI Responses Streaming in Production: Backpressure, Chunk Reassembly, and Timeout Budget

Fri, 27 Mar 2026 01:08:00 +0000

Most streaming failures are not about “can it stream”, but “does it stay stable under load”: broken chunks, stuck clients, timeout cascades, and retry storms.

Handling OpenAI 429/5xx Storms in Go: Token Bucket, Exponential Backoff, and Circuit Breakers

Wed, 18 Mar 2026 01:14:00 +0000

Most Go teams are not killed by a single API error. They are killed by a retry storm they created themselves.