Department of Statistics, University of Warwick

Department of Statistics and Data Science, National University of Singapore
MCMC is used to compute expectations \(\pi(\varphi) = \mathbb{E}_{\pi}[\varphi(X)]\) when sampling from a target \(\pi\) is impossible.
Method:
Standard diagnostics:
The Gelman-Rubin diagnostic (Gelman and Rubin 1992) and ESS have a key limitation:
They are function-specific.
Wouldn’t it be better to diagnose the convergence of the distribution of the chain itself?
Theoretical convergence is measured by statistical distances between the chain’s distribution \(\mu_t\) and the target \(\pi\).
Biswas, Jacob, and Vanetti (2019) provide computable upper bounds on these distances using coupled chains.
A coupling runs two chains \((X_t, Y_t)\) jointly to make them meet:
If \(X_{t+L} \sim \pi\) and \(Y_{t} \sim \mu_t\), then the meeting time \(\frac{\tau - t -L}{L}\) is an estimator of \(\|\mu_t - \pi\|_{\text{TV}}\).
For a target \(\pi\) and proposal \(q\), the independent Metropolis-Hastings (MH) algorithm iterates:
To couple two independent MH chains \((X_t, Y_t)\), we can use the same proposal \(X' \sim q(\cdot)\) for both chains at each step:
Practical Challenges:
We seek a diagnostic that is:
Imagine we knew the density ratio \(\psi_t(x) = \frac{\mathrm{d}\pi}{\mathrm{d}\mu_t}(x)\). Given samples \(X_t^n \sim \mu_t\), we could:
Of course, \(\psi_t(x)\) is intractable. But what if we could approximate it?
A novel “weight harmonization” scheme for parallel MCMC chains that produces a consistent, computable approximation of the importance weights \(\psi_t(X_t^n)\).
A general method to use these weights to get online upper bounds on any \(f\)-divergence, including KL, \(\chi^2\), Hellinger, and TV distance.
The resulting diagnostic is guaranteed to improve over time and requires no lag or warm-up.
The \(\chi^2\)-divergence bound provides an estimate of the “effective number of active chains”, a highly practical and interpretable metric.
Let \(f:[0, \infty) \to \mathbb{R}\) be a convex function with \(f(1)=0\). \[\begin{equation} D_f(\pi \| \mu) = \int \mu(\mathrm{d} x) f\left( \frac{\mathrm{d}\pi}{\mathrm{d}\mu}(x) \right) \end{equation}\]
This family includes many famous divergences:
Consider two discrete measures:
The \(f\)-divergence between them is simple to compute: \[\begin{equation} D_f(\pi_N \| \mu_N) = \frac{1}{N} \sum_{n=1}^N f(N W^n) \end{equation}\]
For the \(\chi^2\)-divergence (\(f(t)=(t-1)^2\)), this becomes: \[\begin{equation} \chi^2(\pi_N \| \mu_N) = N \sum_{n=1}^N (W^n)^2 - 1 \end{equation}\] This is directly related to the Effective Sample Size (ESS) for importance sampling: \[\begin{equation} \text{ESS}(W^{1:N}) = \frac{1}{\sum (W^n)^2} = \frac{N}{\chi^2+1} \end{equation}\]
Key Insight (Theorem 1):
If we have a “good” set of weights \(W_t^{1:N}\) for our MCMC samples \(X_t^{1:N} \sim \mu_t\) (i.e., \(\sum W_t^n \delta_{X_t^n}\) is a consistent estimator of \(\pi\)), then:
\[\begin{equation} \mathbb{P}\left( \frac{1}{N}\sum_{n=1}^N f(N W_t^n) \le D_f(\pi \| \mu_t) - \varepsilon \right) \to 0 \quad \text{as } N \to \infty \end{equation}\]
So, if we can generate good weights \(W_t^n\) at each MCMC step \(t\), we can track the convergence of the chain’s distribution.
How to get weights?
Problem: This gives a consistent estimator, but the diagnostic \(\frac{1}{N}\sum f(N W_0^n)\) is static. It ignores all mixing done by the kernel.
We need weights that evolve.
A coupling \(\bar{K}(x, y, \cdot, \cdot)\) runs two chains \((X_{t+1}, Y_{t+1})\) jointly from \((X_t, Y_t)\), encouraging them to meet.
Key Idea: If chains meet (\(X_{t+1} = Y_{t+1}\)), they become information-theoretically identical.
Therefore, they should share their historical information, i.e., their weights.
\[\begin{align} W_{t+1}^n &= \alpha W_t^n + (1-\alpha) W_t^m \\ W_{t+1}^m &= (1-\alpha) W_t^n + \alpha W_t^m \end{align}\] We use \(\alpha=1/2\), as this choice optimally reduces the variance of the weights.
At step \(t\), we have two chains, \(X_t^n\) and \(X_t^m\), with weights \(W_t^n\) and \(W_t^m\).
We run one step of the coupled kernel: \((X_{t+1}^n, X_{t+1}^m) \sim \bar{K}(X_t^n, X_t^m, \cdot, \cdot)\).
We need to allow all chains to interact. We run \(2N\) chains, viewed as \(N\) pairs.
At each step \(t \to t+1\):
Couple: For each of the \(N\) current pairs \((X_t^n, X_t^{m})\), sample \((X_{t+1}^n, X_{t+1}^{m}) \sim \bar{K}(X_t^n, X_t^{m}, \cdot, \cdot)\).
Harmonize: If \(X_{t+1}^n = X_{t+1}^{m}\), set \(W_{t+1}^n = W_{t+1}^{m} = (W_t^n + W_t^{m})/2\). Otherwise, weights are unchanged.
Reshuffle: Randomly permute the pairings among the chains that just met. This allows information (weights) to propagate throughout the entire system of \(2N\) chains over time.
Proposition (Invariance of Expectations) The un-normalized estimator \(\hat{I}_{t,N}(\varphi) = \sum_{n=1}^{2N} w_t^n \varphi(X_t^n)\) has a constant expectation over time: \[\begin{equation} \mathbb{E}[\hat{I}_{t,N}(\varphi)] = \mathbb{E}[\hat{I}_{0,N}(\varphi)] = 2N \int \varphi(x) \gamma(x)\mathrm{d} x \end{equation}\]
Theorem (Consistency) For any time \(t\), as the number of particles \(N \to \infty\), our weighted sample provides a consistent estimate of the target expectation: \[\begin{equation} \sum_{n=1}^{2N} W_t^n \varphi(X_t^n) \xrightarrow{\mathbb{P}} \int \varphi(x) \pi(x) \mathrm{d} x \end{equation}\]
Proposition (Non-increasing bounds) The \(f\)-divergence upper bound is guaranteed to be non-increasing with time, almost surely. \[\begin{equation} \sum_{n=1}^{2N} f(N W_{t+1}^n) \le \sum_{n=1}^{2N} f(N W_t^n) \end{equation}\]
Thus the ESS is non-decreasing. \[\begin{equation} \text{ESS}_{t+1} \ge \text{ESS}_t \end{equation}\]
Assumption (Minimum probability of coupling) \[\begin{equation} \mathbb{P}(X' = Y' \mid x,y) \ge p_c > 0 \quad \text{for all } x,y \end{equation}\]
Theorem (Geometric Weight Convergence) Under the uniform coupling assumption, the weights converge exponentially fast to the uniform distribution \(\bar{W} = (1/2N, \dots, 1/2N)\): \[\begin{equation} \mathbb{E}[\|W_t - \bar{W}\|_2^2] = O(\rho^t) \quad \text{for some } 0 <\rho < 1 \end{equation}\]
This implies that \(D_f(\pi_N \| \mu_N) \to 0\) almost surely, confirming our method provides a valid convergence diagnostic that vanishes at stationarity.
A Bayesian logistic regression (\(d=49\)) compared with the diagnostic of Biswas, Jacob, and Vanetti (2019).
A high-dimensional (\(d=2500\)) model.
We introduced a new method for MCMC convergence diagnostics based on weight harmonization.
The main limitation is the conservativeness of the bound, especially for large \(N\). This is due to the pairwise interaction scheme.
Potential Improvements:
Rao-Blackwellization: Instead of picking one random pairing, can we average over all possible pairings? This would dramatically increase interaction but is computationally challenging.
Offline Correction: The current method is a “filtering” approach. Can we use ideas from particle smoothing (a “backward pass”) to refine the weights after the run is complete?
Variance Reduction: Incorporate control variates to accelerate the convergence of the diagnostic itself.
Thank you!