Go Dual-Provider LLM Routing (OpenAI + Claude): Timeout Tiers, Cost Caps, and Fallback Control

If your Go service relies on one LLM provider, two failures hurt the most, timeout spikes and billing spikes. A real production setup is not just “add another provider”, it is a single control plane for routing, timeout tiers, cost caps, and fallback.

This guide gives you a practical OpenAI + Claude dual-provider pattern with one priority, keep uptime first, then optimize quality.

Target SLOs and constraints

Success rate beats single-request latency
Per-minute cost must have a hard cap
Provider failures must trigger fast traffic shift

Routing model, classify then decide

Use a 3-layer flow.

Admission layer (scene tags: chat, code, summary)
Policy layer (SLO + budget based provider priority)
Execution layer (timeout budget + retry policy)

type Provider string
const (
    OpenAI Provider = "openai"
    Claude Provider = "claude"
)

type RouteInput struct {
    Scene       string
    MaxLatency  time.Duration
    MaxCostUSD  float64
}

func PickProvider(in RouteInput, health map[Provider]float64) Provider {
    if in.MaxCostUSD < 0.003 { return Claude }
    if health[OpenAI] >= 0.98 { return OpenAI }
    return Claude
}

Timeout tiers, never one global timeout

Split timeout into connect, first-byte, and total budget.

client := &http.Client{
    Timeout: 12 * time.Second,
    Transport: &http.Transport{
        DialContext: (&net.Dialer{Timeout: 2 * time.Second}).DialContext,
        TLSHandshakeTimeout:   2 * time.Second,
        ResponseHeaderTimeout: 4 * time.Second,
        MaxIdleConnsPerHost:   64,
    },
}

Cost cap with minute-level hard gate

type BudgetGate struct {
    LimitPerMin float64
    UsedPerMin  atomic.Float64
}

func (g *BudgetGate) Allow(cost float64) bool {
    if g.UsedPerMin.Load()+cost > g.LimitPerMin { return false }
    g.UsedPerMin.Add(cost)
    return true
}

When over budget.

High-value requests use shorter context and lower temperature
Low-priority requests move to async queue
Non-critical traffic returns cached results

Fallback behavior, make degradation explicit

{
  "status": "degraded",
  "provider": "claude",
  "reason": "openai_timeout_budget_exceeded",
  "trace_id": "rt_20260408_xxx"
}

This keeps incident response fast and debuggable.

Metrics and alerts you actually need

provider_success_rate
provider_p95_latency
provider_timeout_ratio
token_cost_per_min
fallback_trigger_count
degrade_response_ratio

- alert: LLMProviderTimeoutSpike
  expr: provider_timeout_ratio{provider="openai"} > 0.08
  for: 5m
- alert: LLMBudgetNearLimit
  expr: token_cost_per_min > (budget_limit_per_min * 0.9)
  for: 3m

Deployment checklist

Health probes feed routing decisions
Timeout tiers enabled (no one global timeout)
Minute-level budget gate in production
Degraded responses include reason + trace_id
Dashboards and alerts for the 6 core metrics

Summary

Dual-provider routing is about controllability, not feature collecting.

Your minimum viable production loop is, health probes + timeout tiers + budget gate + explicit degradation. Build this first, then tune model quality.

Target SLOs and constraints#

Routing model, classify then decide#

Timeout tiers, never one global timeout#

Cost cap with minute-level hard gate#

Fallback behavior, make degradation explicit#

Metrics and alerts you actually need#

Deployment checklist#

Summary#