Claude + OpenAI Dual-Provider Gateway Failover: Health Probes, Circuit Breaking, and SLA Fallback

If your production stack calls both Claude and OpenAI, the hard part is not API integration. The hard part is keeping user experience stable when one provider starts throwing 429/5xx spikes, regional latency, or timeout storms.

This guide gives you a practical dual-provider gateway playbook: health probes, circuit breaking, SLA-aware fallback, and observability loops. The goal is not “never fail.” The goal is controlled failure with controlled cost and controlled latency.

TL;DR: Three defense layers

Layer 1: Active health probes (minute-level)
- Track p95 latency, error rate, and 429 ratio per model
Layer 2: Circuit breaker + half-open probing (second-level)
- Open on consecutive failures, then probe recovery in half-open mode
Layer 3: SLA fallback policy (request-level)
- Route by tier: quality-first / cost-first / latency-first

Without these three, “multi-provider” usually means two invoices and one outage.

Architecture flow

Client sends x-sla-tier (gold/silver/bronze) and x-intent (chat/code/summarize)
Gateway picks a primary model (for example, claude-sonnet or gpt-4.1)
Before sending, it checks circuit state and last-minute health score
On failure, apply fallback path: lower model in same provider → switch provider
Log every decision to metrics/events for audit and replay

1) Health probes: do not stop at “is alive”

A simple /healthz=200 check is not enough. You need serviceability health.

Suggested metrics

provider_request_success_ratio_1m
provider_p95_latency_ms_1m
provider_429_ratio_1m
provider_5xx_ratio_1m
provider_timeout_ratio_1m

Example score formula

health_score = 100
  - (p95_latency_ms / 1000) * 8
  - timeout_ratio * 60
  - ratio_5xx * 80
  - ratio_429 * 40

if health_score < 65 => unhealthy

2) Circuit breaking and failover: prefer graceful degradation over cascades

Starter circuit breaker settings

Consecutive failure threshold: 5
Open duration: 30s
Half-open probe requests: 3
Half-open success threshold: 2/3

Go-style pseudocode

type CircuitState string
const (
  Closed CircuitState = "closed"
  Open   CircuitState = "open"
  HalfOpen CircuitState = "half_open"
)

func Route(req Request) Target {
  candidates := rankedTargets(req) // rank by intent + quality + cost
  for _, t := range candidates {
    if breaker[t].Allow() && health[t] >= 65 {
      return t
    }
  }
  return emergencyFallback(req)
}

3) SLA fallback policy: define it up front

Tiering suggestion

gold (critical requests)
- cross-provider failover enabled + one retry + high-quality model floor
silver (default requests)
- same-provider downgrade first, then one cross-provider switch
bronze (cost-sensitive requests)
- no expensive fallback; return degraded result on budget cap

YAML policy example

sla:
  gold:
    max_retries: 1
    allow_cross_provider_failover: true
    quality_floor: high
  silver:
    max_retries: 1
    allow_cross_provider_failover: true
    quality_floor: medium
  bronze:
    max_retries: 0
    allow_cross_provider_failover: false
    quality_floor: basic
    token_budget_per_req: 4000

4) MVP command checklist

Start gateway with hot-reload policy

export ROUTER_CONFIG=/etc/llm-router/policy.yaml
export ROUTER_REFRESH_SEC=15
./llm-gateway --listen :8080

Load test with fallback behavior

hey -z 30s -c 20 -m POST \
  -H 'x-sla-tier: silver' \
  -H 'content-type: application/json' \
  -d '{"intent":"chat","prompt":"explain circuit breaker"}' \
  http://127.0.0.1:8080/v1/respond

Trigger a controlled breaker drill

curl -X POST http://127.0.0.1:8080/admin/breakers/open \
  -d '{"provider":"claude","model":"sonnet"}'

Common failures and fixes

Failure 1: route flapping

Symptom: traffic bounces rapidly across providers; latency gets worse.
Root cause: no minimum dwell time.
Fix: add hold_for: 20s; only severe higher-priority failures can interrupt.

Failure 2: immediate re-failure after half-open

Symptom: service fails again as soon as half-open starts.
Root cause: too much probe traffic.
Fix: limit half-open concurrency and use a tighter timeout budget (for example, 2s).

Failure 3: cost explosion during incidents

Symptom: failover shifts too much traffic to expensive models.
Root cause: no degraded-mode budget guardrail.
Fix: enforce degraded_mode_cost_cap; force silver/bronze path over cap.

Observability and alerts

Keep three dashboards first:

Availability: success/error ratio by provider/model
Experience: p50/p95 latency by SLA tier
Cost: token usage and cost per 1k requests

Start with two high-signal alerts:

gold tier success_ratio_5m < 99%
failover_rate_5m > 15%

Summary

Dual-provider routing is not “two API keys.” It is a reliability system. Implement health scoring, circuit breaking, and SLA fallback first, then optimize with A/B routing, semantic caching, and adaptive budgets.

If you need a production-safe starting point this week, ship this MVP:

1-minute health scoring
threshold-based circuit breaker
gold/silver/bronze fallback policy

That alone removes most 2am incidents.

TL;DR: Three defense layers#

Architecture flow#

1) Health probes: do not stop at “is alive”#

Suggested metrics#

Example score formula#

2) Circuit breaking and failover: prefer graceful degradation over cascades#

Starter circuit breaker settings#

Go-style pseudocode#

3) SLA fallback policy: define it up front#

Tiering suggestion#

YAML policy example#

4) MVP command checklist#

Start gateway with hot-reload policy#

Load test with fallback behavior#

Trigger a controlled breaker drill#

Common failures and fixes#

Failure 1: route flapping#

Failure 2: immediate re-failure after half-open#

Failure 3: cost explosion during incidents#

Observability and alerts#

Summary#