Why Batch Size Changes Outputs

April 2026 – Vladislav Kruglikov

Send the same prompt to vLLM twice: once on its own, once batched alongside a few unrelated requests. The logits come back slightly different. Same weights, same input, same seed — different bits.

This is not a bug. It follows from three things about how GPUs do math.

Different batch sizes dispatch to different kernels. Under the hood, a matmul isn't one operation — it's a family of kernels, and libraries like cuBLAS pick one based on the input shape. A 128 × 128 matmul and a 4096 × 4096 matmul run slightly different code, tuned with different tile sizes: too large and small matrices waste work on padding, too small and large matrices pay excessive launch overhead. Change the batch size and you change the shape; change the shape and you may change the kernel.

Different kernels sum floats in different orders. A matmul is, at its core, a lot of summing. GPUs do those sums in parallel — each tile accumulates its partial sum locally, and then those partials get combined. A kernel using 128-wide tiles produces a different tree of partial sums than one using 512-wide tiles. Same numbers going in, different order of additions.

Floating-point addition isn't associative:

(1e16 + 1.0) + 1.0
>>> 1e+16
1e16 + (1.0 + 1.0)
>>> 1.0000000000000002e+16

The 1.0 gets rounded away when it's added to 1e16 first, because the result can't be represented at that magnitude. Group the addition first and the 1.0 survives.

Why this is fine. The error is bounded — a few ULP per reduction, sub-linear in operation count — and in practice sits below the randomness temperature sampling already introduces. Inference servers make the obvious tradeoff: keep the faster non-deterministic kernels and accept that outputs are reproducible up to noise, not up to bits.