Why Most Go Performance Advice Is Outdated (Go 1.25 Edition)

A lot of Go performance advice still circulating today was not born wrong.

It was born early.

Most of it emerged when the Go runtime was younger, the compiler less aggressive, and the garbage collector far more sensitive to allocation patterns. Over time, the runtime evolved — but the advice stayed frozen.

This article is not about replacing one list of rules with another.
It’s about verifying which intuitions still hold in modern Go, and which ones quietly stopped matching reality.

Instead of arguing, we’ll measure.

Allocation vs Lifetime: The Core Misunderstanding

The most persistent performance instinct in Go is the fear of heap allocations.

The intuition feels solid: stack allocations are cheap, heap allocations are expensive, garbage collection is costly. Therefore, avoid heap allocations.

That logic collapses once you separate allocation cost from object lifetime.

In modern Go, allocating an object on the heap is usually cheap. Keeping it alive is not.

Let’s start with the simplest possible benchmark: allocating short-lived heap objects versus doing nothing special at all.

Benchmark: Short-Lived Heap Allocation

package main

import "testing"

var sink int

func allocShortLived(n int) {
    s := 0
    for i := range n { // modern: range over int
        x := new(int)
        *x = i
        s += *x
    }
    sink = s // escape to global to prevent elimination
}

func BenchmarkShortLivedAlloc(b *testing.B) {
    b.ReportAllocs()
    for b.Loop() {
        allocShortLived(1024)
    }
}

func noAlloc(n int) {
    s := 0
    for i := range n {
        x := i
        s += x
    }
    sink = s
}

func BenchmarkShortLived_NoAlloc(b *testing.B) {
    b.ReportAllocs()
    for b.Loop() {
        noAlloc(1024)
    }
}

Benchmark	ns/op (range)	B/op	allocs/op
ShortLivedAlloc (with new)	278–282	0	0
ShortLived_NoAlloc	277–279	0	0

Despite using new(int), the benchmark reports 0 allocations per operation. This means the compiler was able to keep the allocated value on the stack (or eliminate the allocation entirely) because the pointer never escaped.

In modern Go, using pointers does not automatically imply heap allocation. Allocation location is a compiler decision based on escape analysis, not syntax.

Why Preallocation Became a Cargo Cult

Preallocating slices is one of the most common “optimizations” people apply reflexively.

The reasoning is simple: slice growth reallocates memory, reallocation is expensive, therefore we should always preallocate.

That reasoning breaks down when the preallocation guess is wrong — which is most of the time in real systems.

Let’s compare three cases: no preallocation, exact preallocation, and aggressive over-preallocation.

Benchmark: Slice Growth Strategies

package main

import "testing"

const sliceN = 256

// sinks prevent compiler from eliminating work.
var sinkSlice []int
var sinkInt int

func buildNoPrealloc(n int) []int {
    var out []int
    for i := range n { 
        out = append(out, i)
    }
    return out
}

func buildExactPrealloc(n int) []int {
    out := make([]int, 0, n)
    for i := range n {
        out = append(out, i)
    }
    return out
}

func buildOverPrealloc(n int) []int {
    // Intentionally over-allocating to simulate a common "just in case" optimization.
    out := make([]int, 0, n*16)
    for i := range n {
        out = append(out, i)
    }
    return out
}

func BenchmarkSlices_NoPrealloc(b *testing.B) {
    b.ReportAllocs()
    for b.Loop() {
        s := buildNoPrealloc(sliceN)
        // touch result so it can't be optimized away
        sinkInt += len(s)
        sinkSlice = s
    }
}

func BenchmarkSlices_ExactPrealloc(b *testing.B) {
    b.ReportAllocs()
    for b.Loop() {
        s := buildExactPrealloc(sliceN)
        sinkInt += len(s)
        sinkSlice = s
    }
}

func BenchmarkSlices_OverPrealloc(b *testing.B) {
    b.ReportAllocs()
    for b.Loop() {
        s := buildOverPrealloc(sliceN)
        sinkInt += len(s)
        sinkSlice = s
    }
}

Benchmark	ns/op (≈ typical)	B/op	allocs/op
Slices_NoPrealloc (n=256)	~1050	4088	9
Slices_ExactPrealloc	~410	2048	1
Slices_OverPrealloc (x16)	~4500	32768	1

Exact preallocation is the “good” case here: it cuts allocations from 9 to 1, halves B/op, and reduces runtime from ~1050 ns/op to ~410 ns/op. No prealloc triggers slice growth and multiple reallocations (9 allocs/op), which is exactly the kind of overhead preallocation is meant to avoid.

The interesting part is over-preallocation. It still performs only 1 allocation, but it allocates far more memory (32 KB/op) and becomes significantly slower (~4.5 µs/op typical). This doesn’t mean “over-prealloc is always bad” — it means that allocating far more capacity than you’ll use can increase memory traffic and hurt cache behavior, even when allocation count looks great.

In modern Go, allocation count alone is a poor proxy for performance — bytes allocated and object lifetime often matter more.

Interfaces: The Optimization That Rarely Pays

Another long-standing belief is that interface calls are inherently slow and should be avoided in performance-sensitive code.

This belief comes from a time when interface dispatch blocked inlining and added measurable overhead.

Modern compilers are far more capable.

Let’s measure the difference between concrete calls, interface calls, and generic dispatch.

Benchmark: Interface vs Concrete vs Generic

package main

import "testing"

type Adder interface {
    Add(int) int
}

type impl struct {
    base int
}

func (i impl) Add(x int) int {
    return i.base + x
}

func callConcrete(v impl, n int) int {
    sum := 0
    for i := range n {
        sum += v.Add(i)
    }
    return sum
}

func callInterface(v Adder, n int) int {
    sum := 0
    for i := range n {
        sum += v.Add(i)
    }
    return sum
}

func callGeneric[T interface{ Add(int) int }](v T, n int) int {
    sum := 0
    for i := range n {
        sum += v.Add(i)
    }
    return sum
}

func BenchmarkConcrete(b *testing.B) {
    b.ReportAllocs()
    v := impl{base: 10}
    for b.Loop() {
        _ = callConcrete(v, 1024)
    }
}

func BenchmarkInterface(b *testing.B) {
    b.ReportAllocs()
    v := impl{base: 10}
    for b.Loop() {
        _ = callInterface(v, 1024)
    }
}

func BenchmarkGeneric(b *testing.B) {
    b.ReportAllocs()
    v := impl{base: 10}
    for b.Loop() {
        _ = callGeneric(v, 1024)
    }
}

Benchmark	ns/op (≈ typical)	B/op	allocs/op
Concrete call	~278	0	0
Interface call	~1645	0	0
Generic call	~1645	0	0

This benchmark isolates dispatch overhead in a tight loop — no allocations, no I/O, no cache-heavy data structures. In this artificial setting, interface and generic dispatch are ~6× slower than a direct concrete call (~1.6 µs vs ~0.28 µs per 1024 iterations), while still producing 0 allocs/op.

The point is not “interfaces are free” — they aren’t. The point is that this overhead is purely call-level and often disappears inside real workloads where time is dominated by memory access, synchronization, syscalls, or network I/O.

So the modern rule is: interfaces can cost something, but optimizing them is rarely the first place you get meaningful wins — measure before reshaping your APIs.

Note: this generic benchmark uses a method constraint, which may still compile to indirect calls depending on how the compiler specializes the instantiation.

`sync.Pool`: When the Cure Becomes the Disease

sync.Pool is often introduced as a way to “reduce GC pressure”.

That framing is misleading.

sync.Pool is designed to improve throughput by opportunistically reusing short-lived objects. It is explicitly allowed to drop its contents at any time.

Let’s compare direct allocation with pooled reuse.

Benchmark: Allocation vs Pool Reuse

package main

import (
    "sync"
    "testing"
)

var bufPool = sync.Pool{
    New: func() any {
        b := make([]byte, 32*1024)
        return &b
    },
}

func allocBuffers(n int) {
    for i := range n {
        b := make([]byte, 32*1024)
        b[0] = byte(i)
    }
}

func poolBuffers(n int) {
    for i := range n {
        p := bufPool.Get().(*[]byte)
        b := *p
        b[0] = byte(i)
        bufPool.Put(p)
    }
}

func BenchmarkAlloc(b *testing.B) {
    b.ReportAllocs()
    for b.Loop() {
        allocBuffers(128)
    }
}

func BenchmarkPool(b *testing.B) {
    b.ReportAllocs()
    for b.Loop() {
        poolBuffers(128)
    }
}

Benchmark	ns/op (≈ typical)	B/op	allocs/op
Alloc (make)	~41	0	0
Pool (Get/Put)	~1700	0	0

In this benchmark, sync.Pool is dramatically slower than plain allocation: ~1.7 µs/op versus ~41 ns/op. The key detail is that both variants report 0 allocs/op, which means the compiler was able to eliminate the allocation work in the “Alloc” case (and the pool path is mostly measuring Get/Put synchronization and bookkeeping overhead).

This is a good reminder that sync.Pool is not a universal “make things faster” switch. In microbenchmarks where allocations don’t actually hit the heap, a pool can easily be pure overhead.

This benchmark intentionally represents a case where allocations do not escape to the heap, making it a worst-case scenario for sync.Pool.

Retention: The Cost That Actually Hurts

The most expensive performance bugs in modern Go are rarely about how fast memory is allocated.

They are about how long memory stays reachable.

Let’s compare two patterns: retaining large payloads vs extracting only what’s needed.

Benchmark: Memory Retention

package main

import "testing"

var sink2 [][]byte

func badRetention(n int) [][]byte {
    out := make([][]byte, 0, n)
    for range n {
        b := make([]byte, 64*1024)
        out = append(out, b)
    }
    return out
}

func goodRetention(n int) [][]byte {
    out := make([][]byte, 0, n)
    for range n {
        b := make([]byte, 64*1024)
        out = append(out, append([]byte(nil), b[:64]...))
    }
    return out
}

func BenchmarkBadRetention(b *testing.B) {
    b.ReportAllocs()
    for b.Loop() {
        sink2 = badRetention(128)
    }
}

func BenchmarkGoodRetention(b *testing.B) {
    b.ReportAllocs()
    for b.Loop() {
        sink2 = goodRetention(128)
    }
}

Benchmark	ns/op (≈ typical)	B/op	allocs/op
BadRetention	~1.5 ms	~8.0 MB	129
GoodRetention	~90 µs	~11 KB	129

Both benchmarks perform the same number of allocations (129 allocs/op). The difference is not how many objects are allocated, but how much memory they retain.

In the “bad retention” case, each iteration keeps references to large payloads, resulting in ~8 MB of live memory per operation and a runtime of ~1.5 ms/op. In the “good retention” case, the same number of allocations is performed, but only a small slice of each payload is retained, reducing live memory to ~11 KB and execution time to ~90 µs/op.

This is a ~16× difference in runtime with identical allocation counts.

This benchmark demonstrates why modern Go performance issues are rarely about allocation count. Both cases allocate the same number of objects, yet their performance differs by an order of magnitude. What matters is retention: how much memory stays reachable and for how long.

Reducing allocs/op without controlling object lifetime often optimizes the wrong thing.

The Real Shift in Modern Go Performance

What changed by Go 1.25 is not a single feature or trick. The bigger shift is that many mechanical costs got cheaper, so architectural costs now dominate the profile.

Modern Go rewards designs with clear ownership and short-lived data. When lifetimes are explicit and concurrency is bounded, the runtime has much less “mess” to manage, and optimizations become predictable.

Old advice often assumed the runtime was fragile. In modern Go, the runtime is usually fine — it’s unclear lifetimes and accidental retention that break performance.

Closing Thought

If you optimize before measuring, you are probably optimizing a version of Go you are no longer running.

Let the runtime handle mechanics.

Your responsibility is to design systems whose data does not outlive its usefulness.

Why Most Go Performance Advice Is Outdated (Go 1.25 Edition)

Allocation vs Lifetime: The Core Misunderstanding

Benchmark: Short-Lived Heap Allocation

Why Preallocation Became a Cargo Cult

Benchmark: Slice Growth Strategies

Interfaces: The Optimization That Rarely Pays

Benchmark: Interface vs Concrete vs Generic

`sync.Pool`: When the Cure Becomes the Disease

Benchmark: Allocation vs Pool Reuse

Retention: The Cost That Actually Hurts

Benchmark: Memory Retention

The Real Shift in Modern Go Performance

Closing Thought

Comments

More from this blog

Stack vs Heap in Go: How Escape Analysis Actually Works

Designing a High-Load Event Processing Pipeline: When Systems Begin to Breathe

TMA Starter Kit: Build Telegram Mini Apps Without the Setup Struggle

Command Palette

Allocation vs Lifetime: The Core Misunderstanding

Benchmark: Short-Lived Heap Allocation

Why Preallocation Became a Cargo Cult

Benchmark: Slice Growth Strategies

Interfaces: The Optimization That Rarely Pays

Benchmark: Interface vs Concrete vs Generic

sync.Pool: When the Cure Becomes the Disease

Benchmark: Allocation vs Pool Reuse

Retention: The Cost That Actually Hurts

Benchmark: Memory Retention

The Real Shift in Modern Go Performance

Closing Thought

Comments

More from this blog

`sync.Pool`: When the Cure Becomes the Disease