Taming Context Explosion in OpenAI Assistants/Responses with Go: Truncation, Summary Backfill, and Cost Caps

Long-running agent sessions usually fail the same way: context keeps growing, latency spikes, costs blow up, and answer quality gets worse.

That is rarely a model-quality issue. It is almost always missing context governance.

Typical symptoms

p95 latency increases with every conversation turn
token spend grows non-linearly per session
answer consistency drops on repeated intents
more 429/timeouts at peak traffic

A practical 3-layer strategy

1) Hard truncation: budgeted window + allowlist

Use a strict max_context_tokens budget (for example 24k).

Always preserve system constraints and critical user facts
Keep regular history with a recency-first sliding window
Never let request builders exceed a hard ceiling

type Msg struct {
  Role      string
  Content   string
  Important bool
  Tokens    int
}

func TrimByBudget(msgs []Msg, budget int) []Msg {
  keep := make([]Msg, 0, len(msgs))
  used := 0

  for _, m := range msgs {
    if m.Important {
      keep = append(keep, m)
      used += m.Tokens
    }
  }

  for i := len(msgs)-1; i >= 0; i-- {
    m := msgs[i]
    if m.Important { continue }
    if used+m.Tokens > budget { continue }
    keep = append([]Msg{m}, keep...)
    used += m.Tokens
  }
  return keep
}

2) Summary backfill: compress old history into structured memory

Avoid free-form prose summaries. Use a stable schema:

Facts
Open Issues
Preferences
Decisions + Why

This lets you do incremental updates with less drift.

3) Cost caps: per-session budget + graceful degradation

Set a budget (for example $0.5/session):

at 70%: lower max_output_tokens
at 90%: disable expensive tools
at 100%: keep only summary + latest turn

Go implementation notes for Responses API

isolate a dedicated “history builder” module
emit metrics per request: tokens, latency, estimated cost
version your summary schema (summary_schema=v2)
compress tool outputs before reinjection

Troubleshooting checklist

duplicate history append in multiple layers
raw tool output (logs/HTML) injected unbounded
summary drift (assumptions recorded as facts)
token-only budget without price-aware control

Recommended baseline settings

max_context_tokens: 24k
summary_trigger_tokens: 16k
max_output_tokens: 1k
session_cost_cap_usd: 0.5
tool_output_hard_cap_chars: 4000

Wrap-up

Context explosion is an engineering control problem.

If you implement truncation, structured summaries, and strict budget gates, latency and cost become predictable—and model quality work finally becomes meaningful.

Typical symptoms#

A practical 3-layer strategy#

1) Hard truncation: budgeted window + allowlist#

2) Summary backfill: compress old history into structured memory#

3) Cost caps: per-session budget + graceful degradation#

Go implementation notes for Responses API#

Troubleshooting checklist#

Recommended baseline settings#

Wrap-up#