Most multi-tenant AI platforms fail for two boring reasons: one tenant saturates shared capacity, and finance discovers the burn too late.
This guide gives you a practical Go blueprint: token-bucket throttling, budget circuit breakers, and request-level cost attribution.
Three hard goals (conservative but production-safe)
- Isolation: one noisy tenant must not degrade others.
- Control: over-budget traffic is degraded or blocked automatically.
- Attribution: every request is traceable to tenant, model, and cost.
If you do not have all three, do not optimize routing yet. First, stop the bleeding.
Architecture: three gates in order
- Tenant token bucket (QPS/TPM)
- Budget gate (hour/day/month windows)
- Provider call + usage writeback
Client -> API Gateway -> Tenant Middleware
-> Token Bucket (tenant)
-> Budget Guard (tenant/model/time-window)
-> OpenAI Responses API
-> Usage Collector -> Cost Ledger
1) Token buckets: tenant-level + model-level
A global limiter is not enough.
- Tenant bucket: caps total tenant throughput
- Model bucket: caps expensive model traffic separately
package quota
import (
"context"
"fmt"
"time"
"golang.org/x/time/rate"
)
type BucketSet struct {
TenantLimiter *rate.Limiter
ModelLimiter map[string]*rate.Limiter
}
func (b *BucketSet) Allow(ctx context.Context, model string) error {
if !b.TenantLimiter.Allow() {
return fmt.Errorf("tenant_rate_limited")
}
lim, ok := b.ModelLimiter[model]
if ok && !lim.Allow() {
return fmt.Errorf("model_rate_limited")
}
return nil
}
func NewDefaultBucketSet() *BucketSet {
return &BucketSet{
TenantLimiter: rate.NewLimiter(rate.Every(50*time.Millisecond), 20),
ModelLimiter: map[string]*rate.Limiter{
"gpt-5.3": rate.NewLimiter(rate.Every(100*time.Millisecond), 8),
"gpt-5.3-mini": rate.NewLimiter(rate.Every(40*time.Millisecond), 30),
},
}
}
2) Budget circuit breaker: close earlier, not later
Use three levels:
hourly_soft_limit: degradedaily_hard_limit: block non-critical trafficmonthly_hard_limit: strict block except allowlist
package quota
import "fmt"
type BudgetSnapshot struct {
TenantID string
HourlyUsedUSD float64
DailyUsedUSD float64
MonthlyUsedUSD float64
HourlySoftUSD float64
DailyHardUSD float64
MonthlyHardUSD float64
}
type Decision struct {
Action string // allow | degrade | block
Reason string
}
func EvaluateBudget(b BudgetSnapshot) Decision {
if b.MonthlyUsedUSD >= b.MonthlyHardUSD {
return Decision{Action: "block", Reason: "monthly_budget_exceeded"}
}
if b.DailyUsedUSD >= b.DailyHardUSD {
return Decision{Action: "block", Reason: "daily_budget_exceeded"}
}
if b.HourlyUsedUSD >= b.HourlySoftUSD {
return Decision{Action: "degrade", Reason: "hourly_soft_limit_reached"}
}
return Decision{Action: "allow", Reason: "within_budget"}
}
func MustAllow(d Decision) error {
if d.Action == "block" {
return fmt.Errorf(d.Reason)
}
return nil
}
Keep degradation deterministic:
- Downgrade model tier
- Shrink context window
- Disable non-critical tool calls
3) Cost attribution: request-level ledger, not month-end guesswork
Log this per request:
request_id,tenant_id,user_idmodel,input_tokens,output_tokenslatency_ms,status,cost_usdroute(normal vs degraded)
CREATE TABLE ai_cost_ledger (
id BIGSERIAL PRIMARY KEY,
request_id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
model TEXT NOT NULL,
input_tokens INT NOT NULL,
output_tokens INT NOT NULL,
cost_usd NUMERIC(12,6) NOT NULL,
route TEXT NOT NULL,
status TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_ai_cost_ledger_tenant_time ON ai_cost_ledger(tenant_id, created_at DESC);
4) Middleware order for OpenAI Responses in Go
func HandleResponsesRequest(ctx context.Context, req Req) (Resp, error) {
bucket := limiterStore.Get(req.TenantID)
if err := bucket.Allow(ctx, req.Model); err != nil {
return Resp{}, wrap429(err)
}
snap := budgetStore.Snapshot(req.TenantID)
decision := EvaluateBudget(snap)
if decision.Action == "degrade" {
req.Model = "gpt-5.3-mini"
req.Route = "degraded_hourly_budget"
}
if err := MustAllow(decision); err != nil {
return Resp{}, wrap402(err)
}
resp, usage, err := openaiClient.Responses(req)
cost := pricing.Calc(req.Model, usage.InputTokens, usage.OutputTokens)
ledger.Write(req, usage, cost, decision)
budgetStore.Accumulate(req.TenantID, cost)
return resp, err
}
Common failure modes
1) Request-only limiter, no token limiter
Long prompts bypass cost assumptions and blow budgets. Add TPM limits for expensive models.
2) Budget checks and cost writeback are inconsistent
Traffic keeps passing after threshold due to race conditions. Use a consistent transactional boundary.
3) Everything returns 500
Downstream teams cannot react correctly. Return explicit classes:
- 429:
tenant_rate_limited - 402/429:
daily_budget_exceeded - 503:
provider_unavailable
Minimum dashboard
- Tenant QPS/TPM (P95)
- Throttle hit ratio by tenant
- Hour/day/month budget consumption
- Degradation trigger ratio
- Request cost distribution (P50/P95)
- Attribution missing ratio (target ~0)
Safe default parameters (starter profile)
- Tenant limiter:
20 rps burst 20 - Expensive model limiter:
8 rps burst 8 - Hourly soft budget:
0.8%of monthly budget - Daily hard budget:
8%of monthly budget - Monthly hard budget:
100%contract budget - Degrade order: model downshift -> shorter context -> fewer tools
Run this profile for two weeks, then tune with real usage data.
Wrap-up
Reliable multi-tenant AI governance is not magic. It is discipline:
- isolate with token buckets,
- contain with budget breakers,
- explain with request-level ledgers.
Once these are stable, scale is finally a product problem, not a fire drill.