This document investigates the cause of non-determinism in LLMs and introduces techniques for eliminating such non-determinism and ensuring reproducibility (determinism) in LLM (Large Language Model) inference.
この資料では、大規模言語モデル(LLM)における非決定性の原因...
This document investigates the cause of non-determinism in LLMs and introduces techniques for eliminating such non-determinism and ensuring reproducibility (determinism) in LLM (Large Language Model) inference.
Horace He in collaboration with others at Thinking Machines
https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
Defeating non-determinism in
LLM inference
●When adding two float numbers with different exponents and rounding off the number, information is lost when
maintaining precision of the numbers
●Floating point operations especially on GPUs exhibit non-associativity due to
○Finite precision
○Rounding errors
○This usually happens due to GPU kernels using “atomic add” operations in parallel for various neural
network operations; this “atomic add” operation is nondeterministic
●However in case of LLMs, forward pass involves no such operations hence we can assume that the forward pass
for LLM inference is deterministic.
Hypothesis: Floating point non-associativity
How to make kernels batch-invariant?
Kernel Standard Strategy (Large
Batch)
Standard Strategy (Small
Batch)
Batch Invariance Status
RMSNorm Data-parallel: one core cluster
per batch element (row).
Split-reduction: multiple cores
collaborate on a single batch
element.
Violated
Attention Split work along the reduction
(KV) dimension based on
input size.
Split work along the reduction
(KV) dimension based on
input size (strategy changes
with size).
Violated