If your Go service RSS keeps climbing, drops after restart, then climbs again, you likely have a memory retention problem (or an actual leak pattern).
Do not start with random code edits. Run a clean evidence chain: metrics trend check → pprof snapshots → FlameGraph comparison → object growth path → regression validation.
TL;DR
- Trend beats point-in-time values.
- Use
heap+allocs+ GC metrics together, or you will misdiagnose normal cache growth. - Start with
go tool pprof; use FlameGraph to pinpoint the business call path.
1) Confirm it is likely a real leak pattern
Watch these three metrics first:
process_resident_memory_bytes(RSS)go_memstats_heap_inuse_bytesgo_memstats_heap_objects
Escalate to leak triage when:
- QPS is stable (or lower)
- RSS and heap objects still rise over time
- Full GC does not bring memory down meaningfully
2) Expose pprof safely (internal only)
Standard setup
import _ "net/http/pprof"
func init() {
go func() {
// internal or localhost only
_ = http.ListenAndServe("127.0.0.1:6060", nil)
}()
}
Use a dedicated debug port if your service framework is Gin/Echo/Fiber.
Collect snapshots
curl -s http://127.0.0.1:6060/debug/pprof/heap > heap_1.pb.gz
sleep 300
curl -s http://127.0.0.1:6060/debug/pprof/heap > heap_2.pb.gz
curl -s http://127.0.0.1:6060/debug/pprof/allocs > allocs.pb.gz
Always collect at least two time-separated heap snapshots.
3) Use pprof to find growth owners
go tool pprof -top heap_1.pb.gz
go tool pprof -top heap_2.pb.gz
go tool pprof -base heap_1.pb.gz heap_2.pb.gz
Focus on:
- Biggest growth in
inuse_space - Call paths that keep expanding
- Object-heavy types (
map,slice,string, buffers)
Common leak patterns:
- Global map that only grows
- Goroutine leak keeping references alive
- Cache without TTL/LRU limits
- Channel backlog retaining payloads
4) FlameGraph for call-path attribution
go tool pprof -http=:8081 heap_2.pb.gz
In FlameGraph:
- Wider block = larger memory share
- Drill down to business-level functions
- Compare early vs later snapshots to find widening chains
A widening path like BuildResponse -> append -> json.Marshal often indicates oversized object assembly plus retained references.
5) Practical fix example
Problematic pattern
var userCache = map[string]*UserProfile{}
func GetUserProfile(uid string) *UserProfile {
if v, ok := userCache[uid]; ok {
return v
}
p := loadProfile(uid)
userCache[uid] = p
return p
}
With high-cardinality user IDs, this grows forever.
Better approach
- Add size limit (LRU)
- Add expiration (TTL)
- Add observability (size, hit rate, evictions)
Pseudo example:
cache := lru.NewWithTTL(20000, 10*time.Minute)
6) Regression validation (no gut feeling)
After the fix:
- Run the same load profile for 30–60 minutes
- Compare
heap_objectsslope before/after - Re-capture heap and run
-basecomparison
Healthy result:
- Object count plateaus
heap_inusedrops after GC cycles- RSS fluctuates but no longer trends one-way up
7) On-call checklist
- pprof endpoint is internal-only
- at least two heap snapshots (>=5 min apart)
-
-basedelta analysis completed - goroutine count checked
- cache has limit + TTL
- post-fix load regression completed
Summary
In most Go incidents, “memory leak” is not GC failure; it is retention caused by uncontrolled reference paths.
If you run the full chain (metrics → pprof → FlameGraph → fix → regression), you move from guessing to proof-driven incident handling—and that is what actually fixes production issues fast.