If your Go service RSS keeps climbing, drops after restart, then climbs again, you likely have a memory retention problem (or an actual leak pattern).

Do not start with random code edits. Run a clean evidence chain: metrics trend check → pprof snapshots → FlameGraph comparison → object growth path → regression validation.

TL;DR

  • Trend beats point-in-time values.
  • Use heap + allocs + GC metrics together, or you will misdiagnose normal cache growth.
  • Start with go tool pprof; use FlameGraph to pinpoint the business call path.

1) Confirm it is likely a real leak pattern

Watch these three metrics first:

  1. process_resident_memory_bytes (RSS)
  2. go_memstats_heap_inuse_bytes
  3. go_memstats_heap_objects

Escalate to leak triage when:

  • QPS is stable (or lower)
  • RSS and heap objects still rise over time
  • Full GC does not bring memory down meaningfully

2) Expose pprof safely (internal only)

Standard setup

import _ "net/http/pprof"

func init() {
    go func() {
        // internal or localhost only
        _ = http.ListenAndServe("127.0.0.1:6060", nil)
    }()
}

Use a dedicated debug port if your service framework is Gin/Echo/Fiber.

Collect snapshots

curl -s http://127.0.0.1:6060/debug/pprof/heap > heap_1.pb.gz
sleep 300
curl -s http://127.0.0.1:6060/debug/pprof/heap > heap_2.pb.gz

curl -s http://127.0.0.1:6060/debug/pprof/allocs > allocs.pb.gz

Always collect at least two time-separated heap snapshots.

3) Use pprof to find growth owners

go tool pprof -top heap_1.pb.gz
go tool pprof -top heap_2.pb.gz
go tool pprof -base heap_1.pb.gz heap_2.pb.gz

Focus on:

  • Biggest growth in inuse_space
  • Call paths that keep expanding
  • Object-heavy types (map, slice, string, buffers)

Common leak patterns:

  1. Global map that only grows
  2. Goroutine leak keeping references alive
  3. Cache without TTL/LRU limits
  4. Channel backlog retaining payloads

4) FlameGraph for call-path attribution

go tool pprof -http=:8081 heap_2.pb.gz

In FlameGraph:

  • Wider block = larger memory share
  • Drill down to business-level functions
  • Compare early vs later snapshots to find widening chains

A widening path like BuildResponse -> append -> json.Marshal often indicates oversized object assembly plus retained references.

5) Practical fix example

Problematic pattern

var userCache = map[string]*UserProfile{}

func GetUserProfile(uid string) *UserProfile {
    if v, ok := userCache[uid]; ok {
        return v
    }
    p := loadProfile(uid)
    userCache[uid] = p
    return p
}

With high-cardinality user IDs, this grows forever.

Better approach

  • Add size limit (LRU)
  • Add expiration (TTL)
  • Add observability (size, hit rate, evictions)

Pseudo example:

cache := lru.NewWithTTL(20000, 10*time.Minute)

6) Regression validation (no gut feeling)

After the fix:

  1. Run the same load profile for 30–60 minutes
  2. Compare heap_objects slope before/after
  3. Re-capture heap and run -base comparison

Healthy result:

  • Object count plateaus
  • heap_inuse drops after GC cycles
  • RSS fluctuates but no longer trends one-way up

7) On-call checklist

  • pprof endpoint is internal-only
  • at least two heap snapshots (>=5 min apart)
  • -base delta analysis completed
  • goroutine count checked
  • cache has limit + TTL
  • post-fix load regression completed

Summary

In most Go incidents, “memory leak” is not GC failure; it is retention caused by uncontrolled reference paths.

If you run the full chain (metrics → pprof → FlameGraph → fix → regression), you move from guessing to proof-driven incident handling—and that is what actually fixes production issues fast.