Physicists Find LLM Agents May Obey Thermodynamic Laws, But the Implications Remain Murky

Researchers from Peking University's School of Physics have published a paper arguing that LLM-based agents exhibit a property called "detailed balance" in their state transitions, a condition normally associated with equilibrium systems in statistical mechanics. The paper, posted to arXiv in December 2025, frames this as potentially the first discovery of a macroscopic physical law governing LLM dynamics that doesn't depend on model architecture or prompt engineering details.

The claim is ambitious. It suggests that rather than learning discrete rules or strategies, language models implicitly learn something resembling a potential function, a scalar value assigned to each possible state that governs transition probabilities. If true, this would be a surprising unification: models as different as GPT-5 Nano, Claude-4, and Gemini-2.5-flash would all be, in some coarse-grained sense, behaving like molecules in thermal equilibrium.

The physics analogy

Detailed balance is a concept from statistical mechanics that describes reversible Markov processes. A Markov chain is said to obey detailed balance if, at equilibrium, any sequence of states is equally probable when run forwards or backwards. In physical terms, it means that for every pair of states, the probability flux from A to B equals the flux from B to A, weighted by their equilibrium occupancies. Systems satisfying detailed balance are called "reversible" and lack net currents, meaning they've relaxed into genuine thermal equilibrium rather than being driven by external forces.

The Peking University team, led by Zhuo-Yang Song and Hua Xing Zhu, embeds LLMs within agent frameworks and treats their outputs as Markov transitions between states. They then measure transition probabilities empirically and check whether those probabilities satisfy the detailed balance condition. The mathematical statement is simple: if detailed balance holds, there exists a potential function V such that the log-ratio of forward to backward transition probabilities equals the difference in potential values between states.

Their experimental setup involves tasks like conditioned word generation, where an LLM must produce words whose letter indices sum to 100 (like "ATTITUDE" or "BUZZY"), and symbolic fitting using an agent framework called IdeaSearchFitter. By sampling thousands of transitions and counting frequencies, they estimate transition kernels and test whether closed loops in the state graph sum to zero, as detailed balance requires.

What the experiments actually show

The Claude-4 and Gemini-2.5-flash experiments are, frankly, not very convincing tests of detailed balance. Both models converged rapidly to just a handful of states. Claude-4 produced only 5 distinct valid words across 20,000 generations, while Gemini-2.5-flash managed 13. With so few states, the detailed balance condition becomes nearly tautological: if you only ever transition between two or three states, the constraint reduces to simple pairwise ratios that are almost guaranteed to look balanced.

The more interesting data comes from GPT-5 Nano, which explored more broadly, generating 645 different valid words. Here the researchers found 140 distinct triplets where transitions between all three state pairs were measured. The sum of log-ratios around these closed loops clustered near zero, consistent with detailed balance within sampling error. The error bars are substantial, but the trend is there.

The IdeaSearchFitter experiments are more elaborate, involving longer reasoning chains and a database of 50,228 state transitions across 7,484 distinct states. The researchers fit a potential function by minimizing an action functional (essentially finding the V that best explains the observed transition asymmetries), then check whether the resulting V predicts the transition ratios. The scatter plot in Figure 4 shows reasonable agreement along the diagonal, though with significant spread, particularly at extreme potential differences where sampling becomes sparse.

The leap from data to theory

Here's where the paper's framing outstrips its evidence. The authors claim this demonstrates that LLMs have implicitly learned a global potential function that transcends architecture and prompt details. This is a strong claim for data from three models on two narrow tasks.

The detailed balance condition is a sufficient condition for reversibility, but plenty of systems can look approximately balanced over limited observations without truly satisfying the condition globally. The researchers themselves note that at high potential differences (states visited rarely), the equilibrium condition "cannot be strictly satisfied." This is the regime where deviations from equilibrium would actually show up.

More fundamentally, the paper conflates two very different phenomena. Physical systems satisfy detailed balance because of microscopic reversibility, the time-reversal symmetry of underlying dynamics. LLMs have no such symmetry. They're trained to predict next tokens, a fundamentally directional process. If their coarse-grained behavior looks equilibrium-like, this requires explanation, not mere observation.

What would it mean if true?

Suppose the detailed balance claim holds up under broader testing. What would we actually learn?

One possibility: LLMs trained on human-generated text have absorbed patterns that themselves obey some approximate equilibrium structure. Human reasoning might tend toward states that are, in some abstract sense, energetically favorable, and language models might be capturing this through their training objective. This would be consistent with the researchers' suggestion that models learn "potential functions" rather than explicit strategies.

But this framing raises as many questions as it answers. The potential function they estimate through their IdeaSearch method has 49 parameters and captures features like expression complexity and syntactic validity. It predicts about 70% of transitions correctly (transitions going downhill in potential exceed those going uphill by roughly 70-30). This is better than chance but far from the overwhelming asymmetry that physical equilibrium systems exhibit.

The practical implications are similarly unclear. The authors suggest that measuring action values could help design better LLM generation strategies, or that deviation from equilibrium might indicate overfitting. These are speculative applications that would require substantially more validation before anyone should take them seriously.

The FunSearch connection

The paper situates itself within recent work on LLM-driven agents for scientific discovery, citing systems like FunSearch and AlphaEvolve. FunSearch paired an LLM with a programmatic evaluator to discover novel solutions to open problems in mathematics, while AlphaEvolve generalizes this to entire codebases and broader algorithmic optimization tasks.

These evolutionary coding agents share a common structure: generate candidates via LLM, evaluate them programmatically, select the best, and iterate. The Peking University researchers are essentially asking whether the LLM's generative preferences have any consistent structure that could be exploited beyond simple fitness-based selection.

If LLM dynamics truly exhibited detailed balance, one could potentially apply the entire toolkit of equilibrium statistical mechanics, free energy calculations, importance sampling, thermodynamic integration, to agent design. Recent results have shown that for a given invariant measure, reversible Markov chains have the slowest convergence, suggesting that breaking detailed balance might actually accelerate exploration. But this would require the equilibrium assumption to hold more tightly than the current evidence supports.

Why you should be skeptical

Several aspects of the paper warrant skepticism. First, the choice of K(x) = exp(-βx/2) as the convex function defining action is convenient but arbitrary. The authors show that detailed balance is a sufficient condition for minimizing action with this choice, but this is somewhat circular: they've defined action in a way that detailed balance satisfies.

Second, the states themselves are highly engineered. In the word-sum task, the state space is constrained to words whose letters sum to exactly 100. In the symbolic fitting task, states are expression trees with specific structure. These artificial constraints may induce equilibrium-like behavior that wouldn't appear in more naturalistic settings.

Third, the models tested are all operating at temperature 1.0 or with sampling parameters set to maximize exploration. Production deployments typically use lower temperatures, where the dynamics would look very different.

Finally, and most importantly, the paper provides no mechanistic explanation for why detailed balance should emerge. The authors gesture at the possibility that LLMs learn potential functions rather than strategies, but don't explain what training dynamics would produce this result or how it relates to the cross-entropy loss function that actually trains these models.

The European Space Agency has a principle: extraordinary claims require extraordinary evidence. The claim that LLM dynamics obey macroscopic physical laws is extraordinary. The evidence, measuring approximate detailed balance in narrow tasks on three models, is interesting but not extraordinary.

Where this leads

The paper is best understood as a hypothesis-generating exercise rather than a definitive finding. The hypothesis, that coarse-grained LLM dynamics might be describable through equilibrium statistical mechanics, is worth testing more broadly. Future work would need to examine whether detailed balance holds across diverse task types, at different temperatures, with different model families, and crucially, whether deviations from detailed balance correlate with any practically relevant behavior.

The code and data are available on GitHub and Hugging Face, which is commendable. The experimental framework, embedding LLMs in agents and measuring transition statistics, is tractable and extensible. Whether anyone will find enough signal in the noise to build a genuine theory remains to be seen.

For now, file this under "intriguing if true," with emphasis on the conditional.