Long-running agent sessions usually fail the same way: context keeps growing, latency spikes, costs blow up, and answer quality gets worse.
That is rarely a model-quality issue. It is almost always missing context governance.
Typical symptoms
- p95 latency increases with every conversation turn
- token spend grows non-linearly per session
- answer consistency drops on repeated intents
- more 429/timeouts at peak traffic
A practical 3-layer strategy
1) Hard truncation: budgeted window + allowlist
Use a strict max_context_tokens budget (for example 24k).
- Always preserve system constraints and critical user facts
- Keep regular history with a recency-first sliding window
- Never let request builders exceed a hard ceiling
type Msg struct {
Role string
Content string
Important bool
Tokens int
}
func TrimByBudget(msgs []Msg, budget int) []Msg {
keep := make([]Msg, 0, len(msgs))
used := 0
for _, m := range msgs {
if m.Important {
keep = append(keep, m)
used += m.Tokens
}
}
for i := len(msgs)-1; i >= 0; i-- {
m := msgs[i]
if m.Important { continue }
if used+m.Tokens > budget { continue }
keep = append([]Msg{m}, keep...)
used += m.Tokens
}
return keep
}
2) Summary backfill: compress old history into structured memory
Avoid free-form prose summaries. Use a stable schema:
- Facts
- Open Issues
- Preferences
- Decisions + Why
This lets you do incremental updates with less drift.
3) Cost caps: per-session budget + graceful degradation
Set a budget (for example $0.5/session):
- at 70%: lower
max_output_tokens - at 90%: disable expensive tools
- at 100%: keep only summary + latest turn
Go implementation notes for Responses API
- isolate a dedicated “history builder” module
- emit metrics per request: tokens, latency, estimated cost
- version your summary schema (
summary_schema=v2) - compress tool outputs before reinjection
Troubleshooting checklist
- duplicate history append in multiple layers
- raw tool output (logs/HTML) injected unbounded
- summary drift (assumptions recorded as facts)
- token-only budget without price-aware control
Recommended baseline settings
max_context_tokens: 24ksummary_trigger_tokens: 16kmax_output_tokens: 1ksession_cost_cap_usd: 0.5tool_output_hard_cap_chars: 4000
Wrap-up
Context explosion is an engineering control problem.
If you implement truncation, structured summaries, and strict budget gates, latency and cost become predictable—and model quality work finally becomes meaningful.