Claude 3.7 + OpenAI Responses Dual-Stack Degradation Playbook: Timeout Probing, Circuit Cutover, and Error-Budget Dashboard

Running both Claude and OpenAI in production sounds resilient—until a slow failure hits: latency climbs, 429s spike, quality drifts, and everything still looks “up.”

This guide gives you a practical dual-stack degradation runbook: timeout probing first, circuit-based cutover second, and an error-budget dashboard to keep business impact bounded.

TL;DR

Put all traffic behind one provider gateway and tag every request with provider/model/tenant.
Timeout before retry. Per-request timeout should stay within 60% of user-facing SLA.
Use a 3-phase cutover: soft shift → hard cut → canary return.
Manage incidents with a rolling 30-day error budget, not pure intuition.

1) Timeout probing: detect “slow” before “down”

Split timeout telemetry into connect_timeout, ttfb_timeout, and total_timeout. Looking at only total timeout is too late.

# provider probe every 30s
curl -sS http://gateway.internal/health/providers | jq '.providers[] | {name,p95_latency,error_rate,timeout_rate}'

# 15m timeout ratio by provider
curl -G 'http://prometheus.internal/api/v1/query'   --data-urlencode 'query=sum(rate(llm_requests_total{status="timeout"}[15m])) by (provider) / sum(rate(llm_requests_total[15m])) by (provider)'

Suggested alert thresholds:

timeout_rate > 3% for 5 minutes: enter soft-degrade watch mode
timeout_rate > 8% or p95 > SLA x 1.8: candidate for hard cut

2) Circuit-based cutover in 3 phases

2.1 Soft shift

Route 20% traffic to backup provider and watch for 10 minutes:

routes:
  - tenant: default
    policy:
      primary: claude
      secondary: openai
      weights:
        claude: 80
        openai: 20
      failover:
        on_timeout: true
        on_5xx: true
        max_retries: 1

2.2 Hard cut

If primary trips circuit conditions (for example, 5-minute error rate > 12%), fully switch and freeze rollback briefly:

gatewayctl circuit open --provider claude --reason 'timeout_storm_5m'
gatewayctl route set --tenant default --primary openai --secondary claude

2.3 Canary return

Recover gradually: 10% → 30% → 50%, with at least 15 minutes per phase.

3) Error-budget dashboard: your operational brake line

For 99.5% monthly availability, the 30-day downtime budget is about 216 minutes. Split by tenant tier:

Tier-1 paid tenants: if budget usage > 70%, disable expensive fallback paths.
Tier-2 standard tenants: if usage > 85%, enforce short context + single retry.
Internal tooling: throttle first when budget is burned.

SELECT
  provider,
  tenant,
  SUM(error_requests) / NULLIF(SUM(total_requests),0) AS error_rate,
  SUM(timeout_requests) / NULLIF(SUM(total_requests),0) AS timeout_rate
FROM llm_sli_5m
WHERE ts >= now() - interval '30 days'
GROUP BY provider, tenant;

4) Common failure modes

Cost spikes after failover: backup model has larger context by default + retries left open. Cap max_tokens to 70% baseline and retries to 1.
Both providers are slow: this is request-shape debt, not routing debt. Split long prompts, enable semantic cache, move non-realtime jobs to batch.
Circuit flapping: threshold too sensitive. Add cooldown (8–15 min) and minimum sample size.

5) MVP you can ship today

Add three gateway metrics: timeout rate, p95 latency, and 5xx rate.
Turn on soft shift (80/20).
Add hard-cut command + 10-minute rollback freeze.
Build and alert on a 30-day error-budget panel.

Start with observability + soft shift. That alone usually stabilizes your incident curve fast.

TL;DR#

1) Timeout probing: detect “slow” before “down”#

2) Circuit-based cutover in 3 phases#

2.1 Soft shift#

2.2 Hard cut#

2.3 Canary return#

3) Error-budget dashboard: your operational brake line#

4) Common failure modes#

5) MVP you can ship today#