If your Go service relies on one LLM provider, two failures hurt the most, timeout spikes and billing spikes. A real production setup is not just “add another provider”, it is a single control plane for routing, timeout tiers, cost caps, and fallback.

This guide gives you a practical OpenAI + Claude dual-provider pattern with one priority, keep uptime first, then optimize quality.

Target SLOs and constraints

  • Success rate beats single-request latency
  • Per-minute cost must have a hard cap
  • Provider failures must trigger fast traffic shift

Routing model, classify then decide

Use a 3-layer flow.

  1. Admission layer (scene tags: chat, code, summary)
  2. Policy layer (SLO + budget based provider priority)
  3. Execution layer (timeout budget + retry policy)
type Provider string
const (
    OpenAI Provider = "openai"
    Claude Provider = "claude"
)

type RouteInput struct {
    Scene       string
    MaxLatency  time.Duration
    MaxCostUSD  float64
}

func PickProvider(in RouteInput, health map[Provider]float64) Provider {
    if in.MaxCostUSD < 0.003 { return Claude }
    if health[OpenAI] >= 0.98 { return OpenAI }
    return Claude
}

Timeout tiers, never one global timeout

Split timeout into connect, first-byte, and total budget.

client := &http.Client{
    Timeout: 12 * time.Second,
    Transport: &http.Transport{
        DialContext: (&net.Dialer{Timeout: 2 * time.Second}).DialContext,
        TLSHandshakeTimeout:   2 * time.Second,
        ResponseHeaderTimeout: 4 * time.Second,
        MaxIdleConnsPerHost:   64,
    },
}

Cost cap with minute-level hard gate

type BudgetGate struct {
    LimitPerMin float64
    UsedPerMin  atomic.Float64
}

func (g *BudgetGate) Allow(cost float64) bool {
    if g.UsedPerMin.Load()+cost > g.LimitPerMin { return false }
    g.UsedPerMin.Add(cost)
    return true
}

When over budget.

  • High-value requests use shorter context and lower temperature
  • Low-priority requests move to async queue
  • Non-critical traffic returns cached results

Fallback behavior, make degradation explicit

{
  "status": "degraded",
  "provider": "claude",
  "reason": "openai_timeout_budget_exceeded",
  "trace_id": "rt_20260408_xxx"
}

This keeps incident response fast and debuggable.

Metrics and alerts you actually need

  • provider_success_rate
  • provider_p95_latency
  • provider_timeout_ratio
  • token_cost_per_min
  • fallback_trigger_count
  • degrade_response_ratio
- alert: LLMProviderTimeoutSpike
  expr: provider_timeout_ratio{provider="openai"} > 0.08
  for: 5m
- alert: LLMBudgetNearLimit
  expr: token_cost_per_min > (budget_limit_per_min * 0.9)
  for: 3m

Deployment checklist

  1. Health probes feed routing decisions
  2. Timeout tiers enabled (no one global timeout)
  3. Minute-level budget gate in production
  4. Degraded responses include reason + trace_id
  5. Dashboards and alerts for the 6 core metrics

Summary

Dual-provider routing is about controllability, not feature collecting.

Your minimum viable production loop is, health probes + timeout tiers + budget gate + explicit degradation. Build this first, then tune model quality.