If your production stack calls both Claude and OpenAI, the hard part is not API integration. The hard part is keeping user experience stable when one provider starts throwing 429/5xx spikes, regional latency, or timeout storms.
This guide gives you a practical dual-provider gateway playbook: health probes, circuit breaking, SLA-aware fallback, and observability loops. The goal is not “never fail.” The goal is controlled failure with controlled cost and controlled latency.
TL;DR: Three defense layers
- Layer 1: Active health probes (minute-level)
- Track
p95 latency,error rate, and429 ratioper model
- Track
- Layer 2: Circuit breaker + half-open probing (second-level)
- Open on consecutive failures, then probe recovery in half-open mode
- Layer 3: SLA fallback policy (request-level)
- Route by tier: quality-first / cost-first / latency-first
Without these three, “multi-provider” usually means two invoices and one outage.
Architecture flow
- Client sends
x-sla-tier(gold/silver/bronze) andx-intent(chat/code/summarize) - Gateway picks a primary model (for example,
claude-sonnetorgpt-4.1) - Before sending, it checks circuit state and last-minute health score
- On failure, apply fallback path: lower model in same provider → switch provider
- Log every decision to metrics/events for audit and replay
1) Health probes: do not stop at “is alive”
A simple /healthz=200 check is not enough. You need serviceability health.
Suggested metrics
provider_request_success_ratio_1mprovider_p95_latency_ms_1mprovider_429_ratio_1mprovider_5xx_ratio_1mprovider_timeout_ratio_1m
Example score formula
health_score = 100
- (p95_latency_ms / 1000) * 8
- timeout_ratio * 60
- ratio_5xx * 80
- ratio_429 * 40
if health_score < 65 => unhealthy
2) Circuit breaking and failover: prefer graceful degradation over cascades
Starter circuit breaker settings
- Consecutive failure threshold:
5 - Open duration:
30s - Half-open probe requests:
3 - Half-open success threshold:
2/3
Go-style pseudocode
type CircuitState string
const (
Closed CircuitState = "closed"
Open CircuitState = "open"
HalfOpen CircuitState = "half_open"
)
func Route(req Request) Target {
candidates := rankedTargets(req) // rank by intent + quality + cost
for _, t := range candidates {
if breaker[t].Allow() && health[t] >= 65 {
return t
}
}
return emergencyFallback(req)
}
3) SLA fallback policy: define it up front
Tiering suggestion
- gold (critical requests)
- cross-provider failover enabled + one retry + high-quality model floor
- silver (default requests)
- same-provider downgrade first, then one cross-provider switch
- bronze (cost-sensitive requests)
- no expensive fallback; return degraded result on budget cap
YAML policy example
sla:
gold:
max_retries: 1
allow_cross_provider_failover: true
quality_floor: high
silver:
max_retries: 1
allow_cross_provider_failover: true
quality_floor: medium
bronze:
max_retries: 0
allow_cross_provider_failover: false
quality_floor: basic
token_budget_per_req: 4000
4) MVP command checklist
Start gateway with hot-reload policy
export ROUTER_CONFIG=/etc/llm-router/policy.yaml
export ROUTER_REFRESH_SEC=15
./llm-gateway --listen :8080
Load test with fallback behavior
hey -z 30s -c 20 -m POST \
-H 'x-sla-tier: silver' \
-H 'content-type: application/json' \
-d '{"intent":"chat","prompt":"explain circuit breaker"}' \
http://127.0.0.1:8080/v1/respond
Trigger a controlled breaker drill
curl -X POST http://127.0.0.1:8080/admin/breakers/open \
-d '{"provider":"claude","model":"sonnet"}'
Common failures and fixes
Failure 1: route flapping
Symptom: traffic bounces rapidly across providers; latency gets worse.
Root cause: no minimum dwell time.
Fix: add hold_for: 20s; only severe higher-priority failures can interrupt.
Failure 2: immediate re-failure after half-open
Symptom: service fails again as soon as half-open starts.
Root cause: too much probe traffic.
Fix: limit half-open concurrency and use a tighter timeout budget (for example, 2s).
Failure 3: cost explosion during incidents
Symptom: failover shifts too much traffic to expensive models.
Root cause: no degraded-mode budget guardrail.
Fix: enforce degraded_mode_cost_cap; force silver/bronze path over cap.
Observability and alerts
Keep three dashboards first:
- Availability: success/error ratio by provider/model
- Experience: p50/p95 latency by SLA tier
- Cost: token usage and cost per 1k requests
Start with two high-signal alerts:
gold tier success_ratio_5m < 99%failover_rate_5m > 15%
Summary
Dual-provider routing is not “two API keys.” It is a reliability system. Implement health scoring, circuit breaking, and SLA fallback first, then optimize with A/B routing, semantic caching, and adaptive budgets.
If you need a production-safe starting point this week, ship this MVP:
- 1-minute health scoring
- threshold-based circuit breaker
- gold/silver/bronze fallback policy
That alone removes most 2am incidents.