In production Go agents, the first thing that breaks is usually not model quality. It is memory management: context grows, bills spike, and answers drift.

Use a 3-layer memory design:

  • L1: short-term conversational window (seconds)
  • L2: rolling summary (minutes)
  • L3: long-term retrieval memory (days)

1) Define hard rules: what stays in context vs index

Do not push full chat history into every Responses call.

Use strict rules instead:

  1. Keep only the last N turns in L1 (usually 8-12 turns).
  2. Summarize old turns into L2 once a threshold is reached.
  3. Store durable facts in L3 (preferences, constraints, decisions).
  4. Enforce a per-request token budget and a per-session dollar cap.

2) Go data model for layered memory

type Message struct {
    Role      string    `json:"role"`
    Content   string    `json:"content"`
    Timestamp time.Time `json:"ts"`
}

type SessionMemory struct {
    SessionID      string
    ShortWindow    []Message // L1
    RollingSummary string    // L2
    Budget         TokenBudget
}

type TokenBudget struct {
    MaxInputTokens int
    ReserveOutput  int
    HardCapUSD     float64
}

Rule: budget first, prompt assembly second.

3) Prompt assembly order (critical)

For each Responses request, build input in this order:

  1. System instructions
  2. L2 rolling summary
  3. L3 retrieved memory (Top-K with sources)
  4. L1 recent turns
  5. Current user message

This gives better stability and lower cost.

4) Copy-paste context explosion control (Go)

func BuildInput(mem SessionMemory, retrieved []string, userInput string) []Message {
    var input []Message

    input = append(input, Message{Role: "system", Content: "You are a precise coding assistant."})

    if mem.RollingSummary != "" {
        input = append(input, Message{Role: "system", Content: "Session summary:\n" + mem.RollingSummary})
    }

    topK := min(4, len(retrieved))
    if topK > 0 {
        input = append(input, Message{Role: "system", Content: "Retrieved memory:\n" + strings.Join(retrieved[:topK], "\n---\n")})
    }

    window := tail(mem.ShortWindow, 10)
    input = append(input, window...)
    input = append(input, Message{Role: "user", Content: userInput})

    return input
}

5) Cost caps with two gates

Gate A: pre-request token estimation

If estimated_input + reserve_output > MaxInputTokens, reduce in this order:

  1. L1 window size
  2. L2 summary length
  3. L3 topK

Gate B: session-level dollar cap

func GuardCost(sessionCostUSD, hardCapUSD float64) error {
    if sessionCostUSD >= hardCapUSD {
        return fmt.Errorf("cost cap reached: %.2f/%.2f", sessionCostUSD, hardCapUSD)
    }
    return nil
}

When capped, return an explicit message. Never fail silently.

6) L3 memory: keep signal, drop noise

Store only:

  • stable preferences,
  • project-critical facts,
  • historical decisions with rationale.

Do not store:

  • casual chat,
  • transient emotions,
  • expired context.

Practical hygiene:

  • down-rank items with no hits in 7 days,
  • archive after 30 days,
  • flag conflicting facts for review.

7) Production metrics you need

Track at least:

  • agent_input_tokens_p95
  • agent_output_tokens_p95
  • memory_retrieval_hit_rate
  • summary_compression_ratio
  • response_latency_p95
  • cost_usd_per_session

If hit rate drops while input tokens rise, retrieval quality is likely degrading.

8) Fast troubleshooting checklist

  1. Off-topic answers -> inspect L3 for stale facts.
  2. Cost spike -> verify L1 window and L2 length controls.
  3. Latency spike -> check accidental large topK.
  4. Repeated actions -> ensure summary captures completed steps.

Summary

Stable Go + OpenAI Responses agents are not about “remembering everything”. They are about layering, limits, and observability.

A solid default baseline:

  • L1: last 10 turns
  • L2: 300-500 token summary
  • L3: topK=4
  • Session hard cap: $0.5-$2 depending on business value

Start here, then tune with real traffic and metrics.