OpenAI Responses in Go Multi-Tenant Quota Governance: Token Buckets, Budget Circuit Breakers, and Cost Attribution

Most multi-tenant AI platforms fail for two boring reasons: one tenant saturates shared capacity, and finance discovers the burn too late.

This guide gives you a practical Go blueprint: token-bucket throttling, budget circuit breakers, and request-level cost attribution.

Three hard goals (conservative but production-safe)

Isolation: one noisy tenant must not degrade others.
Control: over-budget traffic is degraded or blocked automatically.
Attribution: every request is traceable to tenant, model, and cost.

If you do not have all three, do not optimize routing yet. First, stop the bleeding.

Architecture: three gates in order

Tenant token bucket (QPS/TPM)
Budget gate (hour/day/month windows)
Provider call + usage writeback

Client -> API Gateway -> Tenant Middleware
                     -> Token Bucket (tenant)
                     -> Budget Guard (tenant/model/time-window)
                     -> OpenAI Responses API
                     -> Usage Collector -> Cost Ledger

1) Token buckets: tenant-level + model-level

A global limiter is not enough.

Tenant bucket: caps total tenant throughput
Model bucket: caps expensive model traffic separately

package quota

import (
    "context"
    "fmt"
    "time"

    "golang.org/x/time/rate"
)

type BucketSet struct {
    TenantLimiter *rate.Limiter
    ModelLimiter  map[string]*rate.Limiter
}

func (b *BucketSet) Allow(ctx context.Context, model string) error {
    if !b.TenantLimiter.Allow() {
        return fmt.Errorf("tenant_rate_limited")
    }
    lim, ok := b.ModelLimiter[model]
    if ok && !lim.Allow() {
        return fmt.Errorf("model_rate_limited")
    }
    return nil
}

func NewDefaultBucketSet() *BucketSet {
    return &BucketSet{
        TenantLimiter: rate.NewLimiter(rate.Every(50*time.Millisecond), 20),
        ModelLimiter: map[string]*rate.Limiter{
            "gpt-5.3":      rate.NewLimiter(rate.Every(100*time.Millisecond), 8),
            "gpt-5.3-mini": rate.NewLimiter(rate.Every(40*time.Millisecond), 30),
        },
    }
}

2) Budget circuit breaker: close earlier, not later

Use three levels:

hourly_soft_limit: degrade
daily_hard_limit: block non-critical traffic
monthly_hard_limit: strict block except allowlist

package quota

import "fmt"

type BudgetSnapshot struct {
    TenantID        string
    HourlyUsedUSD   float64
    DailyUsedUSD    float64
    MonthlyUsedUSD  float64
    HourlySoftUSD   float64
    DailyHardUSD    float64
    MonthlyHardUSD  float64
}

type Decision struct {
    Action string // allow | degrade | block
    Reason string
}

func EvaluateBudget(b BudgetSnapshot) Decision {
    if b.MonthlyUsedUSD >= b.MonthlyHardUSD {
        return Decision{Action: "block", Reason: "monthly_budget_exceeded"}
    }
    if b.DailyUsedUSD >= b.DailyHardUSD {
        return Decision{Action: "block", Reason: "daily_budget_exceeded"}
    }
    if b.HourlyUsedUSD >= b.HourlySoftUSD {
        return Decision{Action: "degrade", Reason: "hourly_soft_limit_reached"}
    }
    return Decision{Action: "allow", Reason: "within_budget"}
}

func MustAllow(d Decision) error {
    if d.Action == "block" {
        return fmt.Errorf(d.Reason)
    }
    return nil
}

Keep degradation deterministic:

Downgrade model tier
Shrink context window
Disable non-critical tool calls

3) Cost attribution: request-level ledger, not month-end guesswork

Log this per request:

request_id, tenant_id, user_id
model, input_tokens, output_tokens
latency_ms, status, cost_usd
route (normal vs degraded)

CREATE TABLE ai_cost_ledger (
  id BIGSERIAL PRIMARY KEY,
  request_id TEXT NOT NULL,
  tenant_id TEXT NOT NULL,
  model TEXT NOT NULL,
  input_tokens INT NOT NULL,
  output_tokens INT NOT NULL,
  cost_usd NUMERIC(12,6) NOT NULL,
  route TEXT NOT NULL,
  status TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_ai_cost_ledger_tenant_time ON ai_cost_ledger(tenant_id, created_at DESC);

4) Middleware order for OpenAI Responses in Go

func HandleResponsesRequest(ctx context.Context, req Req) (Resp, error) {
    bucket := limiterStore.Get(req.TenantID)
    if err := bucket.Allow(ctx, req.Model); err != nil {
        return Resp{}, wrap429(err)
    }

    snap := budgetStore.Snapshot(req.TenantID)
    decision := EvaluateBudget(snap)
    if decision.Action == "degrade" {
        req.Model = "gpt-5.3-mini"
        req.Route = "degraded_hourly_budget"
    }
    if err := MustAllow(decision); err != nil {
        return Resp{}, wrap402(err)
    }

    resp, usage, err := openaiClient.Responses(req)
    cost := pricing.Calc(req.Model, usage.InputTokens, usage.OutputTokens)
    ledger.Write(req, usage, cost, decision)
    budgetStore.Accumulate(req.TenantID, cost)

    return resp, err
}

Common failure modes

1) Request-only limiter, no token limiter

Long prompts bypass cost assumptions and blow budgets. Add TPM limits for expensive models.

2) Budget checks and cost writeback are inconsistent

Traffic keeps passing after threshold due to race conditions. Use a consistent transactional boundary.

3) Everything returns 500

Downstream teams cannot react correctly. Return explicit classes:

429: tenant_rate_limited
402/429: daily_budget_exceeded
503: provider_unavailable

Minimum dashboard

Tenant QPS/TPM (P95)
Throttle hit ratio by tenant
Hour/day/month budget consumption
Degradation trigger ratio
Request cost distribution (P50/P95)
Attribution missing ratio (target ~0)

Safe default parameters (starter profile)

Tenant limiter: 20 rps burst 20
Expensive model limiter: 8 rps burst 8
Hourly soft budget: 0.8% of monthly budget
Daily hard budget: 8% of monthly budget
Monthly hard budget: 100% contract budget
Degrade order: model downshift -> shorter context -> fewer tools

Run this profile for two weeks, then tune with real usage data.

Wrap-up

Reliable multi-tenant AI governance is not magic. It is discipline:

isolate with token buckets,
contain with budget breakers,
explain with request-level ledgers.

Once these are stable, scale is finally a product problem, not a fire drill.

Three hard goals (conservative but production-safe)#

Architecture: three gates in order#

1) Token buckets: tenant-level + model-level#

2) Budget circuit breaker: close earlier, not later#

3) Cost attribution: request-level ledger, not month-end guesswork#

4) Middleware order for OpenAI Responses in Go#

Common failure modes#

1) Request-only limiter, no token limiter#

2) Budget checks and cost writeback are inconsistent#

3) Everything returns 500#

Minimum dashboard#

Safe default parameters (starter profile)#

Wrap-up#