Go Memory Leak Triage in Production: pprof + FlameGraph Step by Step

If your Go service RSS keeps climbing, drops after restart, then climbs again, you likely have a memory retention problem (or an actual leak pattern).

Do not start with random code edits. Run a clean evidence chain: metrics trend check → pprof snapshots → FlameGraph comparison → object growth path → regression validation.

TL;DR

Trend beats point-in-time values.
Use heap + allocs + GC metrics together, or you will misdiagnose normal cache growth.
Start with go tool pprof; use FlameGraph to pinpoint the business call path.

1) Confirm it is likely a real leak pattern

Watch these three metrics first:

process_resident_memory_bytes (RSS)
go_memstats_heap_inuse_bytes
go_memstats_heap_objects

Escalate to leak triage when:

QPS is stable (or lower)
RSS and heap objects still rise over time
Full GC does not bring memory down meaningfully

2) Expose pprof safely (internal only)

Standard setup

import _ "net/http/pprof"

func init() {
    go func() {
        // internal or localhost only
        _ = http.ListenAndServe("127.0.0.1:6060", nil)
    }()
}

Use a dedicated debug port if your service framework is Gin/Echo/Fiber.

Collect snapshots

curl -s http://127.0.0.1:6060/debug/pprof/heap > heap_1.pb.gz
sleep 300
curl -s http://127.0.0.1:6060/debug/pprof/heap > heap_2.pb.gz

curl -s http://127.0.0.1:6060/debug/pprof/allocs > allocs.pb.gz

Always collect at least two time-separated heap snapshots.

3) Use pprof to find growth owners

go tool pprof -top heap_1.pb.gz
go tool pprof -top heap_2.pb.gz
go tool pprof -base heap_1.pb.gz heap_2.pb.gz

Focus on:

Biggest growth in inuse_space
Call paths that keep expanding
Object-heavy types (map, slice, string, buffers)

Common leak patterns:

Global map that only grows
Goroutine leak keeping references alive
Cache without TTL/LRU limits
Channel backlog retaining payloads

4) FlameGraph for call-path attribution

go tool pprof -http=:8081 heap_2.pb.gz

In FlameGraph:

Wider block = larger memory share
Drill down to business-level functions
Compare early vs later snapshots to find widening chains

A widening path like BuildResponse -> append -> json.Marshal often indicates oversized object assembly plus retained references.

5) Practical fix example

Problematic pattern

var userCache = map[string]*UserProfile{}

func GetUserProfile(uid string) *UserProfile {
    if v, ok := userCache[uid]; ok {
        return v
    }
    p := loadProfile(uid)
    userCache[uid] = p
    return p
}

With high-cardinality user IDs, this grows forever.

Better approach

Add size limit (LRU)
Add expiration (TTL)
Add observability (size, hit rate, evictions)

Pseudo example:

cache := lru.NewWithTTL(20000, 10*time.Minute)

6) Regression validation (no gut feeling)

After the fix:

Run the same load profile for 30–60 minutes
Compare heap_objects slope before/after
Re-capture heap and run -base comparison

Healthy result:

Object count plateaus
heap_inuse drops after GC cycles
RSS fluctuates but no longer trends one-way up

7) On-call checklist

pprof endpoint is internal-only
at least two heap snapshots (>=5 min apart)
-base delta analysis completed
goroutine count checked
cache has limit + TTL
post-fix load regression completed

Summary

In most Go incidents, “memory leak” is not GC failure; it is retention caused by uncontrolled reference paths.

If you run the full chain (metrics → pprof → FlameGraph → fix → regression), you move from guessing to proof-driven incident handling—and that is what actually fixes production issues fast.

TL;DR#

1) Confirm it is likely a real leak pattern#

2) Expose pprof safely (internal only)#

Standard setup#

Collect snapshots#

3) Use pprof to find growth owners#

4) FlameGraph for call-path attribution#

5) Practical fix example#

Problematic pattern#

Better approach#

6) Regression validation (no gut feeling)#

7) On-call checklist#

Summary#