Defeating non-determinism in LLM inference

NABLAS 3 views 12 slides Oct 27, 2025
Slide 1
Slide 1 of 12
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12

About This Presentation

This document investigates the cause of non-determinism in LLMs and introduces techniques for eliminating such non-determinism and ensuring reproducibility (determinism) in LLM (Large Language Model) inference.


この資料では、大規模言語モデル(LLM)における非決定性の原因...


Slide Content

Horace He in collaboration with others at Thinking Machines
https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Defeating non-determinism in
LLM inference

© NABLAS Inc. All Rights Reserved 2
●What is non-determinism?
○Non-determinism refers to a situation where an outcome cannot be precisely predicted, even with
the same initial conditions, as there may be multiple possible results.
●When using LLMs in cases where we might need reproducible results or need the same response for the
exact same query, usually the responses are different.
●Getting result from a LLM involves “sampling” - converts the output into a probability distribution and
chooses a token based on its probability.
●Even when setting the temperature parameter to zero when doing inference, LLMs still are not
deterministic.
Introduction

© NABLAS Inc. All Rights Reserved 3
So why does non-determinism occur in LLMs?

© NABLAS Inc. All Rights Reserved 4
●For floating point numbers A, B and C
○A + (B + C) ≠(A + B) + C
●In a machine, floating point numbers are represented by two variables - mantissa and exponent
○N = mantissa x 10
exponent

●When adding two float numbers with different exponents and rounding off the number, information is lost when
maintaining precision of the numbers
●Floating point operations especially on GPUs exhibit non-associativity due to
○Finite precision
○Rounding errors
○This usually happens due to GPU kernels using “atomic add” operations in parallel for various neural
network operations; this “atomic add” operation is nondeterministic
●However in case of LLMs, forward pass involves no such operations hence we can assume that the forward pass
for LLM inference is deterministic.
Hypothesis: Floating point non-associativity

© NABLAS Inc. All Rights Reserved 5
●What is “batch-invariancy”?
○When batch size changes for any operation, the result should always be the same
●Similarly, a “non-batch-invariant” kernel results in different results.
●Difference between determinism and batch-invariance
○Determinism: with exactly same parameters the operation will give the same result
○Batch-invariance: with the same batch size AND input, the operation will give same result
●An operation can be deterministic and non-batch-invariant
●When a non-batch-invariant kernel is used as a part of a large inference system, the system can
become nondeterministic.
●In case of making a request to an LLM endpoint for inference, the amount of load on the server
can change the batch size being used by the kernels.
Hypothesis: Batch size

© NABLAS Inc. All Rights Reserved 6
●Non-batch-invariant kernel + nondetermistic load of server
= nondeterministic system
●Hence, the primary reason nearly all LLM inference
endpoints are nondeterministic is that the load (i.e. batch
size) nondeterministically varies.
●This is not unique to GPUs - also observed in CPUs and
TPUs.

© NABLAS Inc. All Rights Reserved 7
●To make an architecture batch-invariant, all kernels within need to be made batch-invariant.
●Safe to assume all pointwise operations in the model are batch-invariant
○Pointwise operations: functions such as addition, multiplication, division
●Operations involving reduction are usually non-batch-invariant
○Example: RMSProp, attention layers in transformers

How to make kernels batch-invariant?
Kernel Standard Strategy (Large
Batch)
Standard Strategy (Small
Batch)
Batch Invariance Status
RMSNorm Data-parallel: one core cluster
per batch element (row).
Split-reduction: multiple cores
collaborate on a single batch
element.
Violated
Attention Split work along the reduction
(KV) dimension based on
input size.
Split work along the reduction
(KV) dimension based on
input size (strategy changes
with size).
Violated

© NABLAS Inc. All Rights Reserved 8
●For RMSProp and similar matmul layers
○Instead of using the adaptive strategy, use a consistent single reduction strategy across all
batch sizes
○Small batch size means that the kernel is likely to execute quickly anyways, and so a
slowdown may not be catastrophic
○Such a reduction strategy would lead to an excess amount of parallelism for larger batch
sizes but would allow us to achieve decent (but not peak) performance across the entire
range of sizes
How to make kernels batch-invariant?

© NABLAS Inc. All Rights Reserved 9
●For attention mechanisms (e.g. FlashAttention 2)
○The solution is to adopt a "fixed split-size" strategy.
○Instead of dividing the KV dimension in a way that depends on the total number of tokens,
the dimension is broken into chunks of a predetermined, fixed size.
○For example, the reduction might always be performed over chunks of 128 elements. The
number of chunks will vary with the input size, but the reduction order within each chunk
and the method of combining the results across chunks remain consistent.
○This ensures an identical computational graph and summation order, preserving batch
invariance regardless of how many tokens are processed at once

How to make kernels batch-invariant?

© NABLAS Inc. All Rights Reserved 10
●For testing, Qwen3-235B-A22B-Instruct-2507 was utilized
●Sampled 1000 completions at temperature = 0
●When using model as it is
○For 1000 completions, only 80 of them were unique
●When using batch-invariant kernels
○All 1000 completions were identical
●Performance -
○For Qwen-3-8B model with 1000 completions
Experiments
Configuration Time (seconds)
vLLM default 26
Unoptimized Deterministic vLLM 55
+ Improved Attention Kernel 42

© NABLAS Inc. All Rights Reserved 11
●It is possible to remove nondeterminism (randomness) from LLM inference by implementing
batch-invariance techniques to architecture kernels
●Having deterministic systems can allow for
○reliable and efficient chance for debugging ML systems
○when deploying such model on production, model validation by reproducing results is the
gold standard - this is a critical requirement for enterprise grade MLOps systems and
regulatory compliance
○Reproducibility of results is also useful in scientific research
●However this comes at a performance cost; although the cost is not too much to deem this
method impractical.
●In what cases is the performance tradeoff worth the determinism? Is it always worth it to keep
models deterministic?
Final thoughts

THANK YOU
SOCIAL MEDIA
https://www.nablas.com/
WEBSITE