Most multi-tenant AI platforms fail for two boring reasons: one tenant saturates shared capacity, and finance discovers the burn too late.

This guide gives you a practical Go blueprint: token-bucket throttling, budget circuit breakers, and request-level cost attribution.

Three hard goals (conservative but production-safe)

  1. Isolation: one noisy tenant must not degrade others.
  2. Control: over-budget traffic is degraded or blocked automatically.
  3. Attribution: every request is traceable to tenant, model, and cost.

If you do not have all three, do not optimize routing yet. First, stop the bleeding.

Architecture: three gates in order

  1. Tenant token bucket (QPS/TPM)
  2. Budget gate (hour/day/month windows)
  3. Provider call + usage writeback
Client -> API Gateway -> Tenant Middleware
                     -> Token Bucket (tenant)
                     -> Budget Guard (tenant/model/time-window)
                     -> OpenAI Responses API
                     -> Usage Collector -> Cost Ledger

1) Token buckets: tenant-level + model-level

A global limiter is not enough.

  • Tenant bucket: caps total tenant throughput
  • Model bucket: caps expensive model traffic separately
package quota

import (
    "context"
    "fmt"
    "time"

    "golang.org/x/time/rate"
)

type BucketSet struct {
    TenantLimiter *rate.Limiter
    ModelLimiter  map[string]*rate.Limiter
}

func (b *BucketSet) Allow(ctx context.Context, model string) error {
    if !b.TenantLimiter.Allow() {
        return fmt.Errorf("tenant_rate_limited")
    }
    lim, ok := b.ModelLimiter[model]
    if ok && !lim.Allow() {
        return fmt.Errorf("model_rate_limited")
    }
    return nil
}

func NewDefaultBucketSet() *BucketSet {
    return &BucketSet{
        TenantLimiter: rate.NewLimiter(rate.Every(50*time.Millisecond), 20),
        ModelLimiter: map[string]*rate.Limiter{
            "gpt-5.3":      rate.NewLimiter(rate.Every(100*time.Millisecond), 8),
            "gpt-5.3-mini": rate.NewLimiter(rate.Every(40*time.Millisecond), 30),
        },
    }
}

2) Budget circuit breaker: close earlier, not later

Use three levels:

  • hourly_soft_limit: degrade
  • daily_hard_limit: block non-critical traffic
  • monthly_hard_limit: strict block except allowlist
package quota

import "fmt"

type BudgetSnapshot struct {
    TenantID        string
    HourlyUsedUSD   float64
    DailyUsedUSD    float64
    MonthlyUsedUSD  float64
    HourlySoftUSD   float64
    DailyHardUSD    float64
    MonthlyHardUSD  float64
}

type Decision struct {
    Action string // allow | degrade | block
    Reason string
}

func EvaluateBudget(b BudgetSnapshot) Decision {
    if b.MonthlyUsedUSD >= b.MonthlyHardUSD {
        return Decision{Action: "block", Reason: "monthly_budget_exceeded"}
    }
    if b.DailyUsedUSD >= b.DailyHardUSD {
        return Decision{Action: "block", Reason: "daily_budget_exceeded"}
    }
    if b.HourlyUsedUSD >= b.HourlySoftUSD {
        return Decision{Action: "degrade", Reason: "hourly_soft_limit_reached"}
    }
    return Decision{Action: "allow", Reason: "within_budget"}
}

func MustAllow(d Decision) error {
    if d.Action == "block" {
        return fmt.Errorf(d.Reason)
    }
    return nil
}

Keep degradation deterministic:

  1. Downgrade model tier
  2. Shrink context window
  3. Disable non-critical tool calls

3) Cost attribution: request-level ledger, not month-end guesswork

Log this per request:

  • request_id, tenant_id, user_id
  • model, input_tokens, output_tokens
  • latency_ms, status, cost_usd
  • route (normal vs degraded)
CREATE TABLE ai_cost_ledger (
  id BIGSERIAL PRIMARY KEY,
  request_id TEXT NOT NULL,
  tenant_id TEXT NOT NULL,
  model TEXT NOT NULL,
  input_tokens INT NOT NULL,
  output_tokens INT NOT NULL,
  cost_usd NUMERIC(12,6) NOT NULL,
  route TEXT NOT NULL,
  status TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_ai_cost_ledger_tenant_time ON ai_cost_ledger(tenant_id, created_at DESC);

4) Middleware order for OpenAI Responses in Go

func HandleResponsesRequest(ctx context.Context, req Req) (Resp, error) {
    bucket := limiterStore.Get(req.TenantID)
    if err := bucket.Allow(ctx, req.Model); err != nil {
        return Resp{}, wrap429(err)
    }

    snap := budgetStore.Snapshot(req.TenantID)
    decision := EvaluateBudget(snap)
    if decision.Action == "degrade" {
        req.Model = "gpt-5.3-mini"
        req.Route = "degraded_hourly_budget"
    }
    if err := MustAllow(decision); err != nil {
        return Resp{}, wrap402(err)
    }

    resp, usage, err := openaiClient.Responses(req)
    cost := pricing.Calc(req.Model, usage.InputTokens, usage.OutputTokens)
    ledger.Write(req, usage, cost, decision)
    budgetStore.Accumulate(req.TenantID, cost)

    return resp, err
}

Common failure modes

1) Request-only limiter, no token limiter

Long prompts bypass cost assumptions and blow budgets. Add TPM limits for expensive models.

2) Budget checks and cost writeback are inconsistent

Traffic keeps passing after threshold due to race conditions. Use a consistent transactional boundary.

3) Everything returns 500

Downstream teams cannot react correctly. Return explicit classes:

  • 429: tenant_rate_limited
  • 402/429: daily_budget_exceeded
  • 503: provider_unavailable

Minimum dashboard

  1. Tenant QPS/TPM (P95)
  2. Throttle hit ratio by tenant
  3. Hour/day/month budget consumption
  4. Degradation trigger ratio
  5. Request cost distribution (P50/P95)
  6. Attribution missing ratio (target ~0)

Safe default parameters (starter profile)

  • Tenant limiter: 20 rps burst 20
  • Expensive model limiter: 8 rps burst 8
  • Hourly soft budget: 0.8% of monthly budget
  • Daily hard budget: 8% of monthly budget
  • Monthly hard budget: 100% contract budget
  • Degrade order: model downshift -> shorter context -> fewer tools

Run this profile for two weeks, then tune with real usage data.

Wrap-up

Reliable multi-tenant AI governance is not magic. It is discipline:

  • isolate with token buckets,
  • contain with budget breakers,
  • explain with request-level ledgers.

Once these are stable, scale is finally a product problem, not a fire drill.