Running both Claude and OpenAI in production sounds resilient—until a slow failure hits: latency climbs, 429s spike, quality drifts, and everything still looks “up.”

This guide gives you a practical dual-stack degradation runbook: timeout probing first, circuit-based cutover second, and an error-budget dashboard to keep business impact bounded.

TL;DR

  • Put all traffic behind one provider gateway and tag every request with provider/model/tenant.
  • Timeout before retry. Per-request timeout should stay within 60% of user-facing SLA.
  • Use a 3-phase cutover: soft shift → hard cut → canary return.
  • Manage incidents with a rolling 30-day error budget, not pure intuition.

1) Timeout probing: detect “slow” before “down”

Split timeout telemetry into connect_timeout, ttfb_timeout, and total_timeout. Looking at only total timeout is too late.

# provider probe every 30s
curl -sS http://gateway.internal/health/providers | jq '.providers[] | {name,p95_latency,error_rate,timeout_rate}'

# 15m timeout ratio by provider
curl -G 'http://prometheus.internal/api/v1/query'   --data-urlencode 'query=sum(rate(llm_requests_total{status="timeout"}[15m])) by (provider) / sum(rate(llm_requests_total[15m])) by (provider)'

Suggested alert thresholds:

  • timeout_rate > 3% for 5 minutes: enter soft-degrade watch mode
  • timeout_rate > 8% or p95 > SLA x 1.8: candidate for hard cut

2) Circuit-based cutover in 3 phases

2.1 Soft shift

Route 20% traffic to backup provider and watch for 10 minutes:

routes:
  - tenant: default
    policy:
      primary: claude
      secondary: openai
      weights:
        claude: 80
        openai: 20
      failover:
        on_timeout: true
        on_5xx: true
        max_retries: 1

2.2 Hard cut

If primary trips circuit conditions (for example, 5-minute error rate > 12%), fully switch and freeze rollback briefly:

gatewayctl circuit open --provider claude --reason 'timeout_storm_5m'
gatewayctl route set --tenant default --primary openai --secondary claude

2.3 Canary return

Recover gradually: 10% → 30% → 50%, with at least 15 minutes per phase.

3) Error-budget dashboard: your operational brake line

For 99.5% monthly availability, the 30-day downtime budget is about 216 minutes. Split by tenant tier:

  • Tier-1 paid tenants: if budget usage > 70%, disable expensive fallback paths.
  • Tier-2 standard tenants: if usage > 85%, enforce short context + single retry.
  • Internal tooling: throttle first when budget is burned.
SELECT
  provider,
  tenant,
  SUM(error_requests) / NULLIF(SUM(total_requests),0) AS error_rate,
  SUM(timeout_requests) / NULLIF(SUM(total_requests),0) AS timeout_rate
FROM llm_sli_5m
WHERE ts >= now() - interval '30 days'
GROUP BY provider, tenant;

4) Common failure modes

  • Cost spikes after failover: backup model has larger context by default + retries left open. Cap max_tokens to 70% baseline and retries to 1.
  • Both providers are slow: this is request-shape debt, not routing debt. Split long prompts, enable semantic cache, move non-realtime jobs to batch.
  • Circuit flapping: threshold too sensitive. Add cooldown (8–15 min) and minimum sample size.

5) MVP you can ship today

  1. Add three gateway metrics: timeout rate, p95 latency, and 5xx rate.
  2. Turn on soft shift (80/20).
  3. Add hard-cut command + 10-minute rollback freeze.
  4. Build and alert on a 30-day error-budget panel.

Start with observability + soft shift. That alone usually stabilizes your incident curve fast.