Long-running agent sessions usually fail the same way: context keeps growing, latency spikes, costs blow up, and answer quality gets worse.

That is rarely a model-quality issue. It is almost always missing context governance.

Typical symptoms

  • p95 latency increases with every conversation turn
  • token spend grows non-linearly per session
  • answer consistency drops on repeated intents
  • more 429/timeouts at peak traffic

A practical 3-layer strategy

1) Hard truncation: budgeted window + allowlist

Use a strict max_context_tokens budget (for example 24k).

  • Always preserve system constraints and critical user facts
  • Keep regular history with a recency-first sliding window
  • Never let request builders exceed a hard ceiling
type Msg struct {
  Role      string
  Content   string
  Important bool
  Tokens    int
}

func TrimByBudget(msgs []Msg, budget int) []Msg {
  keep := make([]Msg, 0, len(msgs))
  used := 0

  for _, m := range msgs {
    if m.Important {
      keep = append(keep, m)
      used += m.Tokens
    }
  }

  for i := len(msgs)-1; i >= 0; i-- {
    m := msgs[i]
    if m.Important { continue }
    if used+m.Tokens > budget { continue }
    keep = append([]Msg{m}, keep...)
    used += m.Tokens
  }
  return keep
}

2) Summary backfill: compress old history into structured memory

Avoid free-form prose summaries. Use a stable schema:

  • Facts
  • Open Issues
  • Preferences
  • Decisions + Why

This lets you do incremental updates with less drift.

3) Cost caps: per-session budget + graceful degradation

Set a budget (for example $0.5/session):

  • at 70%: lower max_output_tokens
  • at 90%: disable expensive tools
  • at 100%: keep only summary + latest turn

Go implementation notes for Responses API

  • isolate a dedicated “history builder” module
  • emit metrics per request: tokens, latency, estimated cost
  • version your summary schema (summary_schema=v2)
  • compress tool outputs before reinjection

Troubleshooting checklist

  1. duplicate history append in multiple layers
  2. raw tool output (logs/HTML) injected unbounded
  3. summary drift (assumptions recorded as facts)
  4. token-only budget without price-aware control
  • max_context_tokens: 24k
  • summary_trigger_tokens: 16k
  • max_output_tokens: 1k
  • session_cost_cap_usd: 0.5
  • tool_output_hard_cap_chars: 4000

Wrap-up

Context explosion is an engineering control problem.

If you implement truncation, structured summaries, and strict budget gates, latency and cost become predictable—and model quality work finally becomes meaningful.