Accelerating MCMC Simulation

Neural network-driven importance sampling for accelerated Markov chain Monte Carlo

Accelerated Markov Chain Monte Carlo Simulation via Neural Network-Driven Importance Sampling

Michael M. Kim, Wei Cai

Department of Mechanical Engineering, Stanford University

Paper (PDF) arXiv Code BibTeX

Overview

Atomistic simulations are a powerful tool for understanding material behavior at the microscopic scale, but their applicability is fundamentally limited by the timescales accessible to brute-force methods. Systems remain trapped within metastable states for long stretches of simulation time, while the rare transitions between states are what actually drive the long-term evolution—a difficulty widely known as the time-scale or rare-event problem.

This work presents an importance-sampling framework for accelerating discrete-state Markov chain Monte Carlo (MCMC) simulations, in which a neural network is trained to represent the optimal bias potential—the logarithm of the optimal importance function. By operating in log-space, the method handles the extremely small magnitudes of the importance function near energy minima that would otherwise cause numerical underflow at low temperatures. A rigorous reweighting formula recovers the unbiased transition rates of the original system, and a branching random walk (BRW) procedure controls the variance of the estimator.

This is the discrete-state precursor to my continuous Langevin dynamics framework, which extends the same ideas to atomistic systems with continuous degrees of freedom.

Problem Setup

Consider a discrete domain $\Omega$ of microscopic states laid out on a regular grid, with an energy landscape $E(i)$ defined on each site. The kinetics is specified by jump rates between nearest-neighbor sites:

$$ r(i \to j) = \nu_0 \, \exp\!\left[-\frac{E(j) - E(i)}{2 k_B T}\right] \, \mathbb{1}_{\rm nn}(i,j), $$

where $\nu_0$ is a frequency prefactor and $\mathbb{1}_{\rm nn}(i,j)=1$ for nearest neighbors. These rates satisfy detailed balance with respect to the Boltzmann distribution. The corresponding jump probabilities upon leaving site $i$ are $K(i\to j) = r(i\to j)/r_{\rm tot}(i)$.

Discrete energy landscape with two basins — **Figure 1.** A discrete domain $\Omega$ of microscopic states on a 2D grid, colored by the energy landscape $E(i)$. Two metastable minima A and B sit at the bottoms of the basins, separated by saddle points $S_1$ and $S_2$. The white path shows a transition from A to B passing near $S_1$.

At low temperatures, a system initialized near A spends a long time in the A-basin before a rare transition into the B-basin. The two practical objectives are (i) obtaining accurate transition rates between metastable states, and (ii) resolving the relative probabilities of competing reaction channels (e.g. those that pass through $S_1$ vs. $S_2$).

Approach

Importance sampling modifies the jump probabilities through an importance function $I(i)$:

$$ K'(i\to j) = \frac{1}{n'(i)} \, K(i\to j) \, \frac{I(j)}{I(i)} \, \big[1 - \mathbb{1}_{\rm A}(j)\big], $$

where the indicator $\mathbb{1}_{\rm A}(j)$ removes returns to A, and $n'(i)$ is a normalization factor. There exists a unique optimal importance function $I^{\rm opt}(i)$ for which all the $n'(i) = 1$; under this choice, every successful path is enhanced by exactly the same factor $I^{\rm opt}({\rm B})/I^{\rm opt}({\rm A})$, so relative path probabilities are preserved and the original transition rate can be recovered exactly. The catch: finding $I^{\rm opt}$ exactly is as hard as solving the rare-event problem itself.

The framework solves three concrete obstacles to making this practical at high dimension and low temperature:

Neural-network bias potential, trained in log-space. Instead of fitting $I^{\rm opt}(i)$ directly—which can be exponentially small near energy minima and cause numerical underflow—a neural network is trained to represent the bias potential $E_b(i;\theta) \propto -k_B T \log I^{\rm opt}(i)$. Working in log-space stabilizes both training and sampling at low temperatures.
Adaptive sampling with simulated annealing. Training begins at a high effective temperature where successful transitions are easy to sample, and the temperature is reduced in stages down to the target. At each stage, the network is updated using the paths produced under its own biased dynamics, avoiding the chicken-and-egg problem of needing successful paths to learn the bias.
Hybrid Gaussian + MLP for high dimensions. For the 14-dimensional system, the bias potential is parameterized as $E_b = A \exp[-\sum_i a_i (x_i - c_i)^2] + \mathrm{MLP}(x; \theta)$. The Gaussian term gives a coarse global gradient that drives early exploration; the MLP refines it into the optimal shape.
Unbiased rate estimator with branching random walk. When $I(i) \neq I^{\rm opt}(i)$, the path-weight reweighting still recovers the correct rate, but with high variance. The BRW procedure splits paths whose weights grow too large and prunes those whose weights become negligible, keeping the weight distribution within a controlled range and dramatically reducing the variance of the rate estimator.

Results

2D System with Two Reaction Channels

The first benchmark is the 2D landscape shown in Figure 1, with energy minima at $(\pm 1.1, 0)$ nm and saddle points $S_1, S_2$ at $(0, \pm 1)$ nm with barriers of $1.02$ and $0.98$ eV respectively. Because the state space is finite, the optimal bias potential can be computed numerically and used as a ground truth. The neural network—two hidden layers of 30 tanh units, trained with simulated annealing from 5800 K down to 500 K—recovers the exact bias potential to high accuracy, with the largest residual errors confined to sparsely sampled states far from the dominant transition pathways.

Neural network bias potential, 2D — **Figure 2.** Neural-network bias potential on the 2D system after 100 epochs per annealing temperature. Left: learned bias potential. Center: parity plot against the exact optimal bias potential. Right: spatial distribution of the residual error; deviations are small along the dominant transition pathways and concentrated in rarely visited states.

Parity plot, 2D — **Figure 2.** Neural-network bias potential on the 2D system after 100 epochs per annealing temperature. Left: learned bias potential. Center: parity plot against the exact optimal bias potential. Right: spatial distribution of the residual error; deviations are small along the dominant transition pathways and concentrated in rarely visited states.

Variance Control via Branching Random Walk

To stress-test the framework, the network is trained on a coarse grid ($\Delta x = 0.1$ nm) and then used to bias dynamics on a refined grid ($\Delta x = 0.025$ nm), introducing a deliberate discretization mismatch. Without BRW, the path weights span many orders of magnitude—a few outlier paths dominate the estimator. With BRW (weight range $[0.5, 1.2]$), the weights are kept within a tight band, and the success-probability estimator becomes both accurate and low-variance: $\hat P_{\rm AB}=(1.52 \pm 0.06)\times 10^{-14}$ versus the exact $1.41\times 10^{-14}$.

Weight distribution without BRW — **Figure 3.** Path-weight distribution without BRW (left) versus with BRW (right). BRW concentrates weights in a controlled range $[0.5, 1.2]$, eliminating both near-zero ``unimportant'' paths and runaway high-weight outliers.

Weight distribution with BRW — **Figure 3.** Path-weight distribution without BRW (left) versus with BRW (right). BRW concentrates weights in a controlled range $[0.5, 1.2]$, eliminating both near-zero ``unimportant'' paths and runaway high-weight outliers.

Transition Rates and Mechanisms

The estimated transition rates converge as the grid is refined, and their temperature dependence agrees with Kramers' rate theory across more than ten orders of magnitude. Crucially, the framework also resolves the mechanism: the fraction of transitions through the upper saddle $S_1$ versus the lower saddle $S_2$ is recovered accurately, including the correction beyond the naive barrier-energy ratio.

**Figure 4.** Left: Arrhenius plot of the estimated transition rate (stars) against Kramers' theory (dashed). Right: fraction of transitions passing through the upper saddle $S_1$ as a function of temperature; the importance-sampling estimates match Kramers' theory and deviate from a naive prediction based on barrier-energy differences alone.

Fraction through saddle 1 — **Figure 4.** Left: Arrhenius plot of the estimated transition rate (stars) against Kramers' theory (dashed). Right: fraction of transitions passing through the upper saddle $S_1$ as a function of temperature; the importance-sampling estimates match Kramers' theory and deviate from a naive prediction based on barrier-energy differences alone.

Scaling to 14 Dimensions

The 14-dimensional benchmark embeds the 2D landscape into a higher-dimensional hypercube by adding harmonic terms in the remaining twelve coordinates. The exact optimal bias potential remains independent of the harmonic directions, which gives a clean ground truth for the 14D-trained network when projected back to the 2D subspace. With the hybrid Gaussian + MLP parameterization (two hidden layers of 100 ReLU units), the projected bias potential agrees with the exact 2D solution within the regions traversed by the dominant transition pathways.

14D neural network bias potential projected to 2D — **Figure 5.** Hybrid Gaussian + MLP bias potential trained on the 14D system, projected onto the $(x_1, x_2)$ subspace. Left: learned bias potential. Center: parity against the exact 2D solution. Right: residual error, again small along the dominant transition pathways.

14D parity plot — **Figure 5.** Hybrid Gaussian + MLP bias potential trained on the 14D system, projected onto the $(x_1, x_2)$ subspace. Left: learned bias potential. Center: parity against the exact 2D solution. Right: residual error, again small along the dominant transition pathways.

Takeaway. Training a neural network to represent the bias potential—rather than the importance function itself—keeps the framework numerically stable at low temperatures, while simulated annealing and adaptive sampling make it work in high dimensions. Combined with rigorous reweighting and branching random walk variance control, the method recovers both the right transition rates and the right relative channel probabilities, even when the learned bias potential is imperfect.

Connection and Future Directions

The continuous-state extension of this framework, applied to overdamped Langevin dynamics and the Lennard-Jones 7 atomistic benchmark, is described on the Rare Events page. Ongoing directions include applying the method to atomistic systems where rare events drive long-term behavior—crystal defect evolution, dislocation dynamics, and protein conformational change—and connecting it to broader machine-learning frameworks for stochastic dynamics.

Reference

M. M. Kim and W. Cai, Accelerated Markov Chain Monte Carlo Simulation via Neural Network–Driven Importance Sampling.