In production Go agents, the first thing that breaks is usually not model quality. It is memory management: context grows, bills spike, and answers drift.
Use a 3-layer memory design:
- L1: short-term conversational window (seconds)
- L2: rolling summary (minutes)
- L3: long-term retrieval memory (days)
1) Define hard rules: what stays in context vs index
Do not push full chat history into every Responses call.
Use strict rules instead:
- Keep only the last N turns in L1 (usually 8-12 turns).
- Summarize old turns into L2 once a threshold is reached.
- Store durable facts in L3 (preferences, constraints, decisions).
- Enforce a per-request token budget and a per-session dollar cap.
2) Go data model for layered memory
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
Timestamp time.Time `json:"ts"`
}
type SessionMemory struct {
SessionID string
ShortWindow []Message // L1
RollingSummary string // L2
Budget TokenBudget
}
type TokenBudget struct {
MaxInputTokens int
ReserveOutput int
HardCapUSD float64
}
Rule: budget first, prompt assembly second.
3) Prompt assembly order (critical)
For each Responses request, build input in this order:
- System instructions
- L2 rolling summary
- L3 retrieved memory (Top-K with sources)
- L1 recent turns
- Current user message
This gives better stability and lower cost.
4) Copy-paste context explosion control (Go)
func BuildInput(mem SessionMemory, retrieved []string, userInput string) []Message {
var input []Message
input = append(input, Message{Role: "system", Content: "You are a precise coding assistant."})
if mem.RollingSummary != "" {
input = append(input, Message{Role: "system", Content: "Session summary:\n" + mem.RollingSummary})
}
topK := min(4, len(retrieved))
if topK > 0 {
input = append(input, Message{Role: "system", Content: "Retrieved memory:\n" + strings.Join(retrieved[:topK], "\n---\n")})
}
window := tail(mem.ShortWindow, 10)
input = append(input, window...)
input = append(input, Message{Role: "user", Content: userInput})
return input
}
5) Cost caps with two gates
Gate A: pre-request token estimation
If estimated_input + reserve_output > MaxInputTokens, reduce in this order:
- L1 window size
- L2 summary length
- L3 topK
Gate B: session-level dollar cap
func GuardCost(sessionCostUSD, hardCapUSD float64) error {
if sessionCostUSD >= hardCapUSD {
return fmt.Errorf("cost cap reached: %.2f/%.2f", sessionCostUSD, hardCapUSD)
}
return nil
}
When capped, return an explicit message. Never fail silently.
6) L3 memory: keep signal, drop noise
Store only:
- stable preferences,
- project-critical facts,
- historical decisions with rationale.
Do not store:
- casual chat,
- transient emotions,
- expired context.
Practical hygiene:
- down-rank items with no hits in 7 days,
- archive after 30 days,
- flag conflicting facts for review.
7) Production metrics you need
Track at least:
agent_input_tokens_p95agent_output_tokens_p95memory_retrieval_hit_ratesummary_compression_ratioresponse_latency_p95cost_usd_per_session
If hit rate drops while input tokens rise, retrieval quality is likely degrading.
8) Fast troubleshooting checklist
- Off-topic answers -> inspect L3 for stale facts.
- Cost spike -> verify L1 window and L2 length controls.
- Latency spike -> check accidental large topK.
- Repeated actions -> ensure summary captures completed steps.
Summary
Stable Go + OpenAI Responses agents are not about “remembering everything”. They are about layering, limits, and observability.
A solid default baseline:
- L1: last 10 turns
- L2: 300-500 token summary
- L3: topK=4
- Session hard cap:
$0.5-$2depending on business value
Start here, then tune with real traffic and metrics.