If your Go service relies on one LLM provider, two failures hurt the most, timeout spikes and billing spikes. A real production setup is not just “add another provider”, it is a single control plane for routing, timeout tiers, cost caps, and fallback.
This guide gives you a practical OpenAI + Claude dual-provider pattern with one priority, keep uptime first, then optimize quality.
Target SLOs and constraints
- Success rate beats single-request latency
- Per-minute cost must have a hard cap
- Provider failures must trigger fast traffic shift
Routing model, classify then decide
Use a 3-layer flow.
- Admission layer (scene tags: chat, code, summary)
- Policy layer (SLO + budget based provider priority)
- Execution layer (timeout budget + retry policy)
type Provider string
const (
OpenAI Provider = "openai"
Claude Provider = "claude"
)
type RouteInput struct {
Scene string
MaxLatency time.Duration
MaxCostUSD float64
}
func PickProvider(in RouteInput, health map[Provider]float64) Provider {
if in.MaxCostUSD < 0.003 { return Claude }
if health[OpenAI] >= 0.98 { return OpenAI }
return Claude
}
Timeout tiers, never one global timeout
Split timeout into connect, first-byte, and total budget.
client := &http.Client{
Timeout: 12 * time.Second,
Transport: &http.Transport{
DialContext: (&net.Dialer{Timeout: 2 * time.Second}).DialContext,
TLSHandshakeTimeout: 2 * time.Second,
ResponseHeaderTimeout: 4 * time.Second,
MaxIdleConnsPerHost: 64,
},
}
Cost cap with minute-level hard gate
type BudgetGate struct {
LimitPerMin float64
UsedPerMin atomic.Float64
}
func (g *BudgetGate) Allow(cost float64) bool {
if g.UsedPerMin.Load()+cost > g.LimitPerMin { return false }
g.UsedPerMin.Add(cost)
return true
}
When over budget.
- High-value requests use shorter context and lower temperature
- Low-priority requests move to async queue
- Non-critical traffic returns cached results
Fallback behavior, make degradation explicit
{
"status": "degraded",
"provider": "claude",
"reason": "openai_timeout_budget_exceeded",
"trace_id": "rt_20260408_xxx"
}
This keeps incident response fast and debuggable.
Metrics and alerts you actually need
- provider_success_rate
- provider_p95_latency
- provider_timeout_ratio
- token_cost_per_min
- fallback_trigger_count
- degrade_response_ratio
- alert: LLMProviderTimeoutSpike
expr: provider_timeout_ratio{provider="openai"} > 0.08
for: 5m
- alert: LLMBudgetNearLimit
expr: token_cost_per_min > (budget_limit_per_min * 0.9)
for: 3m
Deployment checklist
- Health probes feed routing decisions
- Timeout tiers enabled (no one global timeout)
- Minute-level budget gate in production
- Degraded responses include reason + trace_id
- Dashboards and alerts for the 6 core metrics
Summary
Dual-provider routing is about controllability, not feature collecting.
Your minimum viable production loop is, health probes + timeout tiers + budget gate + explicit degradation. Build this first, then tune model quality.