Running both Claude and OpenAI in production sounds resilient—until a slow failure hits: latency climbs, 429s spike, quality drifts, and everything still looks “up.”
This guide gives you a practical dual-stack degradation runbook: timeout probing first, circuit-based cutover second, and an error-budget dashboard to keep business impact bounded.
TL;DR
- Put all traffic behind one provider gateway and tag every request with
provider/model/tenant. - Timeout before retry. Per-request timeout should stay within 60% of user-facing SLA.
- Use a 3-phase cutover: soft shift → hard cut → canary return.
- Manage incidents with a rolling 30-day error budget, not pure intuition.
1) Timeout probing: detect “slow” before “down”
Split timeout telemetry into connect_timeout, ttfb_timeout, and total_timeout. Looking at only total timeout is too late.
# provider probe every 30s
curl -sS http://gateway.internal/health/providers | jq '.providers[] | {name,p95_latency,error_rate,timeout_rate}'
# 15m timeout ratio by provider
curl -G 'http://prometheus.internal/api/v1/query' --data-urlencode 'query=sum(rate(llm_requests_total{status="timeout"}[15m])) by (provider) / sum(rate(llm_requests_total[15m])) by (provider)'
Suggested alert thresholds:
timeout_rate > 3%for 5 minutes: enter soft-degrade watch modetimeout_rate > 8%orp95 > SLA x 1.8: candidate for hard cut
2) Circuit-based cutover in 3 phases
2.1 Soft shift
Route 20% traffic to backup provider and watch for 10 minutes:
routes:
- tenant: default
policy:
primary: claude
secondary: openai
weights:
claude: 80
openai: 20
failover:
on_timeout: true
on_5xx: true
max_retries: 1
2.2 Hard cut
If primary trips circuit conditions (for example, 5-minute error rate > 12%), fully switch and freeze rollback briefly:
gatewayctl circuit open --provider claude --reason 'timeout_storm_5m'
gatewayctl route set --tenant default --primary openai --secondary claude
2.3 Canary return
Recover gradually: 10% → 30% → 50%, with at least 15 minutes per phase.
3) Error-budget dashboard: your operational brake line
For 99.5% monthly availability, the 30-day downtime budget is about 216 minutes. Split by tenant tier:
- Tier-1 paid tenants: if budget usage > 70%, disable expensive fallback paths.
- Tier-2 standard tenants: if usage > 85%, enforce short context + single retry.
- Internal tooling: throttle first when budget is burned.
SELECT
provider,
tenant,
SUM(error_requests) / NULLIF(SUM(total_requests),0) AS error_rate,
SUM(timeout_requests) / NULLIF(SUM(total_requests),0) AS timeout_rate
FROM llm_sli_5m
WHERE ts >= now() - interval '30 days'
GROUP BY provider, tenant;
4) Common failure modes
- Cost spikes after failover: backup model has larger context by default + retries left open. Cap
max_tokensto 70% baseline and retries to 1. - Both providers are slow: this is request-shape debt, not routing debt. Split long prompts, enable semantic cache, move non-realtime jobs to batch.
- Circuit flapping: threshold too sensitive. Add cooldown (8–15 min) and minimum sample size.
5) MVP you can ship today
- Add three gateway metrics: timeout rate, p95 latency, and 5xx rate.
- Turn on soft shift (80/20).
- Add hard-cut command + 10-minute rollback freeze.
- Build and alert on a 30-day error-budget panel.
Start with observability + soft shift. That alone usually stabilizes your incident curve fast.