If your production stack calls both Claude and OpenAI, the hard part is not API integration. The hard part is keeping user experience stable when one provider starts throwing 429/5xx spikes, regional latency, or timeout storms.

This guide gives you a practical dual-provider gateway playbook: health probes, circuit breaking, SLA-aware fallback, and observability loops. The goal is not “never fail.” The goal is controlled failure with controlled cost and controlled latency.

TL;DR: Three defense layers

  • Layer 1: Active health probes (minute-level)
    • Track p95 latency, error rate, and 429 ratio per model
  • Layer 2: Circuit breaker + half-open probing (second-level)
    • Open on consecutive failures, then probe recovery in half-open mode
  • Layer 3: SLA fallback policy (request-level)
    • Route by tier: quality-first / cost-first / latency-first

Without these three, “multi-provider” usually means two invoices and one outage.

Architecture flow

  1. Client sends x-sla-tier (gold/silver/bronze) and x-intent (chat/code/summarize)
  2. Gateway picks a primary model (for example, claude-sonnet or gpt-4.1)
  3. Before sending, it checks circuit state and last-minute health score
  4. On failure, apply fallback path: lower model in same provider → switch provider
  5. Log every decision to metrics/events for audit and replay

1) Health probes: do not stop at “is alive”

A simple /healthz=200 check is not enough. You need serviceability health.

Suggested metrics

  • provider_request_success_ratio_1m
  • provider_p95_latency_ms_1m
  • provider_429_ratio_1m
  • provider_5xx_ratio_1m
  • provider_timeout_ratio_1m

Example score formula

health_score = 100
  - (p95_latency_ms / 1000) * 8
  - timeout_ratio * 60
  - ratio_5xx * 80
  - ratio_429 * 40

if health_score < 65 => unhealthy

2) Circuit breaking and failover: prefer graceful degradation over cascades

Starter circuit breaker settings

  • Consecutive failure threshold: 5
  • Open duration: 30s
  • Half-open probe requests: 3
  • Half-open success threshold: 2/3

Go-style pseudocode

type CircuitState string
const (
  Closed CircuitState = "closed"
  Open   CircuitState = "open"
  HalfOpen CircuitState = "half_open"
)

func Route(req Request) Target {
  candidates := rankedTargets(req) // rank by intent + quality + cost
  for _, t := range candidates {
    if breaker[t].Allow() && health[t] >= 65 {
      return t
    }
  }
  return emergencyFallback(req)
}

3) SLA fallback policy: define it up front

Tiering suggestion

  • gold (critical requests)
    • cross-provider failover enabled + one retry + high-quality model floor
  • silver (default requests)
    • same-provider downgrade first, then one cross-provider switch
  • bronze (cost-sensitive requests)
    • no expensive fallback; return degraded result on budget cap

YAML policy example

sla:
  gold:
    max_retries: 1
    allow_cross_provider_failover: true
    quality_floor: high
  silver:
    max_retries: 1
    allow_cross_provider_failover: true
    quality_floor: medium
  bronze:
    max_retries: 0
    allow_cross_provider_failover: false
    quality_floor: basic
    token_budget_per_req: 4000

4) MVP command checklist

Start gateway with hot-reload policy

export ROUTER_CONFIG=/etc/llm-router/policy.yaml
export ROUTER_REFRESH_SEC=15
./llm-gateway --listen :8080

Load test with fallback behavior

hey -z 30s -c 20 -m POST \
  -H 'x-sla-tier: silver' \
  -H 'content-type: application/json' \
  -d '{"intent":"chat","prompt":"explain circuit breaker"}' \
  http://127.0.0.1:8080/v1/respond

Trigger a controlled breaker drill

curl -X POST http://127.0.0.1:8080/admin/breakers/open \
  -d '{"provider":"claude","model":"sonnet"}'

Common failures and fixes

Failure 1: route flapping

Symptom: traffic bounces rapidly across providers; latency gets worse.
Root cause: no minimum dwell time.
Fix: add hold_for: 20s; only severe higher-priority failures can interrupt.

Failure 2: immediate re-failure after half-open

Symptom: service fails again as soon as half-open starts.
Root cause: too much probe traffic.
Fix: limit half-open concurrency and use a tighter timeout budget (for example, 2s).

Failure 3: cost explosion during incidents

Symptom: failover shifts too much traffic to expensive models.
Root cause: no degraded-mode budget guardrail.
Fix: enforce degraded_mode_cost_cap; force silver/bronze path over cap.

Observability and alerts

Keep three dashboards first:

  • Availability: success/error ratio by provider/model
  • Experience: p50/p95 latency by SLA tier
  • Cost: token usage and cost per 1k requests

Start with two high-signal alerts:

  • gold tier success_ratio_5m < 99%
  • failover_rate_5m > 15%

Summary

Dual-provider routing is not “two API keys.” It is a reliability system. Implement health scoring, circuit breaking, and SLA fallback first, then optimize with A/B routing, semantic caching, and adaptive budgets.

If you need a production-safe starting point this week, ship this MVP:

  1. 1-minute health scoring
  2. threshold-based circuit breaker
  3. gold/silver/bronze fallback policy

That alone removes most 2am incidents.