Go + OpenAI Responses Agent Memory Layering: Short-Term Context, Long-Term Index, and Cost Caps

In production Go agents, the first thing that breaks is usually not model quality. It is memory management: context grows, bills spike, and answers drift.

Use a 3-layer memory design:

L1: short-term conversational window (seconds)
L2: rolling summary (minutes)
L3: long-term retrieval memory (days)

1) Define hard rules: what stays in context vs index

Do not push full chat history into every Responses call.

Use strict rules instead:

Keep only the last N turns in L1 (usually 8-12 turns).
Summarize old turns into L2 once a threshold is reached.
Store durable facts in L3 (preferences, constraints, decisions).
Enforce a per-request token budget and a per-session dollar cap.

2) Go data model for layered memory

type Message struct {
    Role      string    `json:"role"`
    Content   string    `json:"content"`
    Timestamp time.Time `json:"ts"`
}

type SessionMemory struct {
    SessionID      string
    ShortWindow    []Message // L1
    RollingSummary string    // L2
    Budget         TokenBudget
}

type TokenBudget struct {
    MaxInputTokens int
    ReserveOutput  int
    HardCapUSD     float64
}

Rule: budget first, prompt assembly second.

3) Prompt assembly order (critical)

For each Responses request, build input in this order:

System instructions
L2 rolling summary
L3 retrieved memory (Top-K with sources)
L1 recent turns
Current user message

This gives better stability and lower cost.

4) Copy-paste context explosion control (Go)

func BuildInput(mem SessionMemory, retrieved []string, userInput string) []Message {
    var input []Message

    input = append(input, Message{Role: "system", Content: "You are a precise coding assistant."})

    if mem.RollingSummary != "" {
        input = append(input, Message{Role: "system", Content: "Session summary:\n" + mem.RollingSummary})
    }

    topK := min(4, len(retrieved))
    if topK > 0 {
        input = append(input, Message{Role: "system", Content: "Retrieved memory:\n" + strings.Join(retrieved[:topK], "\n---\n")})
    }

    window := tail(mem.ShortWindow, 10)
    input = append(input, window...)
    input = append(input, Message{Role: "user", Content: userInput})

    return input
}

5) Cost caps with two gates

Gate A: pre-request token estimation

If estimated_input + reserve_output > MaxInputTokens, reduce in this order:

L1 window size
L2 summary length
L3 topK

Gate B: session-level dollar cap

func GuardCost(sessionCostUSD, hardCapUSD float64) error {
    if sessionCostUSD >= hardCapUSD {
        return fmt.Errorf("cost cap reached: %.2f/%.2f", sessionCostUSD, hardCapUSD)
    }
    return nil
}

When capped, return an explicit message. Never fail silently.

6) L3 memory: keep signal, drop noise

Store only:

stable preferences,
project-critical facts,
historical decisions with rationale.

Do not store:

casual chat,
transient emotions,
expired context.

Practical hygiene:

down-rank items with no hits in 7 days,
archive after 30 days,
flag conflicting facts for review.

7) Production metrics you need

Track at least:

agent_input_tokens_p95
agent_output_tokens_p95
memory_retrieval_hit_rate
summary_compression_ratio
response_latency_p95
cost_usd_per_session

If hit rate drops while input tokens rise, retrieval quality is likely degrading.

8) Fast troubleshooting checklist

Off-topic answers -> inspect L3 for stale facts.
Cost spike -> verify L1 window and L2 length controls.
Latency spike -> check accidental large topK.
Repeated actions -> ensure summary captures completed steps.

Summary

Stable Go + OpenAI Responses agents are not about “remembering everything”. They are about layering, limits, and observability.

A solid default baseline:

L1: last 10 turns
L2: 300-500 token summary
L3: topK=4
Session hard cap: $0.5-$2 depending on business value

Start here, then tune with real traffic and metrics.

1) Define hard rules: what stays in context vs index#

2) Go data model for layered memory#

3) Prompt assembly order (critical)#

4) Copy-paste context explosion control (Go)#

5) Cost caps with two gates#

Gate A: pre-request token estimation#

Gate B: session-level dollar cap#

6) L3 memory: keep signal, drop noise#

7) Production metrics you need#

8) Fast troubleshooting checklist#

Summary#