Wiley Series in Probability and Statistics

Reuven Y. Rubinstein, Dirk P. Kroese

Chapter 6 Markov Chain Monte Carlo - all with Video Answers

Educators

Chapter Questions

Problem 1

Verify that the local balance equation (6.3) holds for the Metropolis-Hastings algorithm.

Check back soon!

When running an MCMC algorithm, it is important to know when the transient (or burn-in) period has finished; otherwise, steady-state statistical analyses such as those in Section 4.3.2 may not be applicable. In practice this is often done via a visual inspection of the sample path. As an example, run the random walk sampler with normal target distribution $\mathrm{N}(10,1)$ and proposal $Y \sim \mathrm{~N}(x, 0.01)$. Take a sample size of $N=5000$. Determine roughly when the process reaches stationarity.

Check back soon!

Problem 3

A useful tool for examining the behavior of a stationary process $\left\{X_t\right\}$ obtained, for example, from an MCMC simulation, is the covariance function $R(t)=\operatorname{Cov}\left(X_t, X_0\right)$; see Example 6.4. Estimate the covariance function for the process in Problem 6.2 and plot the results. In Matlab's signal processing toolbox this is implemented under the M-function xcov.m. Try different proposal distributions of the form $N\left(x, \sigma^2\right)$ and observe how the covariance function changes.

Check back soon!

Problem 4

Implement the independence sampler with an $\operatorname{Exp}(1)$ target and an $\operatorname{Exp}(\lambda)$ proposal distribution for several values of $\lambda$. Similar to the importance sampling situation, things go awry when the sampling distribution gets too far from the target distribution, in this case when $\lambda>2$. For each run, use a sample size of $10^5$ and start with $x=1$.
a) For each value $\lambda=0.2,1,2$, and 5 , plot a histogram of the data and compare it with the true pdf.
b) For each value of the above values of $\lambda$, calculate the sample mean and repeat this for 20 independent runs. Make a dotplot of the data (plot them on a line) and notice the differences. Observe that for $\lambda=5$ most of the sample means are below 1 , and thus underestimate the true expectation 1 , but a few are significantly greater. Observe also the behavior of the corresponding auto-covariance functions, both between the different $\lambda$ s and, for $\lambda=5$, within the 20 runs.

Check back soon!

Problem 5

Implement the random walk sampler with an $\operatorname{Exp}(1)$ target distribution, where $Z$ (in the proposal $Y=x+Z$ ) has a double exponential distribution with parameter $\lambda$. Carry out a study similar to that in Problem 6.4 for different values of $\lambda$, say $\lambda=0.1,1,5,20$. Observe that (in this case) the random walk sampler has a more stable behavior than the independence sampler.

Check back soon!

01:57

Problem 6

Let $\mathbf{X}=(X, Y)^T$ be a random column vector with a bivariate normal distribution with expectation vector $\mathbf{0}=(0,0)^T$ and covariance matrix

$$
\Sigma=\left(\begin{array}{ll}
1 & \varrho \\
\varrho & 1
\end{array}\right)
$$

a) Show that $(Y \mid X=x) \sim \mathrm{N}\left(\varrho x, 1-\varrho^2\right)$ and $(X \mid Y=y) \sim \mathrm{N}\left(\varrho y, 1-\varrho^2\right)$.
b) Write a systematic Gibbs sampler to draw $10^4$ samples from the bivariate distribution $\mathrm{N}(0, \Sigma)$ and plot the data for $\varrho=0,0.7$ and 0.9 .

Hunza Gilgit

Numerade Educator

Problem 7

A remarkable feature of the Gibbs sampler is that the conditional distributions in Algorithm 6.4.1 contain sufficient information to generate a sample from the joint one. The following result (by Hammersley and Clifford [9]) shows that it is possible to directly express the joint pdf in terms of the conditional ones. Namely,

$$
f(x, y)=\frac{f_{Y \mid X}(y \mid x)}{\int \frac{f_{Y \mid X}(y \mid x)}{f_{X \mid Y}(x \mid y)} d y}
$$

Prove this. Generalize this to the $n$-dimensional case.

Check back soon!

01:38

Problem 8

In the Ising model the expected magnetization per spin is given by

$$
M(T)=\frac{1}{n^2} \mathbb{E}_{\pi_T}\left[\sum_i S_i\right],
$$

where $\pi_T$ is the Boltzmann distribution at temperature $T$. Estimate $M(T)$, for example via the Swendsen-Wang algorithm, for various values of $T \in[0,5]$, and observe that the graph of $M(T)$ changes sharply around the critical temperature $T \approx 2.61$. Take $n=20$ and use periodic boundaries.

Penny Riley

Numerade Educator

Problem 9

Run Peter Young's Java applet in
http://bartok.ucsc.edu/peter/java/ising/keep/ising.html
to gain a better understanding of how the Ising model works.

Check back soon!

Problem 10

As in Example 6.6, let $\mathscr{X}^*=\left\{\mathbf{x}: \sum_{i=1}^n x_i=m, x_i \in\{0, \ldots, m\}, i=1, \ldots, n\right\}$. Show that this set has $\binom{m+n-1}{n-1}$ elements.

Check back soon!

Problem 11

In a simple model for a closed queueing network with $n$ queues and $m$ customers, it is assumed that the service times are independent and exponentially distributed, say with rate $\mu_i$ for queue $i, i=1, \ldots, n$. After completing service at queue $i$, the customer moves to queue $j$ with probability $p_{i j}$. The $\left\{p_{i j}\right\}$ are the so-called routing probabilities.

Figure 6.16 A closed queueing network.

It can be shown (see, for example, [12]) that the stationary distribution of the number of customers in the queues is of product form (6.10), with $f_i$ being the pdf of the $\mathrm{G}\left(1-y_i / \mu_i\right)$ distribution; thus, $f_i\left(x_i\right) \propto\left(y_i / \mu_i\right)^{x_i}$. Here the $\left\{y_i\right\}$ are constants that are obtained from the following set of flow balance equations:

$$
y_i=\sum_j y_j p_{j i}, \quad i=1, \ldots, n,
$$

which has a one-dimensional solution space. Without loss of generality, $y_1$ can be set to 1 to obtain a unique solution.
Consider now the specific case of the network depicted in Figure 6.16, with $n=3$ queues. Suppose the service rates are $\mu_1=2, \mu_2=1$, and $\mu_3=1$. The routing probabilities are given in the figure.
a) Show that a solution to (6.25) is $\left(y_1, y_2, y_3\right)=(1,10 / 21,4 / 7)$.
b) For $m=50$ determine the exact normalization constant $C$.
c) Implement the procedure of Example 6.6 to estimate $C$ via MCMC and compare the estimate for $m=50$ with the exact value.

Check back soon!

Problem 12

Let $X_1, \ldots, X_n$ be a random sample from the $\mathrm{N}\left(\mu, \sigma^2\right)$ distribution. Consider the following Bayesian model:
- $f\left(\mu, \sigma^2\right)=1 / \sigma^2$ :
- ( $\mathbf{x}_i \mid \mu, \sigma$ ) ~ $\mathrm{N}\left(\mu, \sigma^2\right), i=1, \ldots, n$ independently.

Note that the prior for ( $\mu, \sigma^2$ ) is improper. That is, it is not a pdf in itself, but by obstinately applying Bayes' formula, it does yield a proper posterior pdf. In some sense it conveys the least amount of information about $\mu$ and $\sigma^2$. Let $\mathbf{x}=\left(x_1, \ldots, x_n\right)$ represent the data. The posterior pdf is given by

$$
f\left(\mu, \sigma^2 \mid \mathbf{x}\right)=\left(2 \pi \sigma^2\right)^{-n / 2} \exp \left\{-\frac{1}{2} \frac{\sum_i\left(x_i-\mu\right)^2}{\sigma^2}\right\} \frac{1}{\sigma^2} .
$$

We wish to sample from this distribution via the Gibbs sampler.
a) Show that $\left(\mu \mid \sigma^2, \mathbf{x}\right) \sim N\left(\bar{x}, \sigma^2 / n\right)$, where $\bar{x}$ is the sample mean.
b) Prove that

$$
f\left(\sigma^2 \mid \mu, \mathbf{x}\right) \propto \frac{1}{\left(\sigma^2\right)^{n / 2+1}} \exp \left(-\frac{n}{2} \frac{V_\mu}{\sigma^2}\right),
$$

where $V_\mu=\sum_i\left(x_i-\mu\right)^2 / n$ is the classical sample variance for known $\mu$. In other words, $\left(1 / \sigma^2 \mid \mu, \mathbf{x}\right) \sim \operatorname{Gamma}\left(n / 2, n V_\mu / 2\right)$.
c) Implement a Gibbs sampler to sample from the posterior distribution, taking $n=100$. Run the sampler for $10^5$ iterations. Plot the histograms of $f(\mu \mid \mathbf{x})$ and $f\left(\sigma^2 \mid \mathbf{x}\right)$ and find the sample means of these posteriors. Compare them with the classical estimates.
d) Show that the true posterior pdf of $\mu$ given the data is given by

$$
f(\mu \mid \mathbf{x}) \propto\left((\mu-\bar{x})^2+V\right)^{-n / 2},
$$

where $V=\sum_i\left(x_i-\bar{x}\right)^2 / n$. (Hint: in order to evaluate the integral

$$
f(\mu \mid \mathbf{x})=\int_0^{\infty} f\left(\mu, \sigma^2 \mid \mathbf{x}\right) d \sigma^2
$$

write it first as $(2 \pi)^{-n / 2} \int_0^{\infty} t^{n / 2-1} \exp \left(-\frac{1}{2} t c\right) d t$, where $c=n V_\mu$, by applying the change of variable $t=1 / \sigma^2$. Show that the latter integral is proportional to $c^{-n / 2}$. Finally, apply the decomposition $V_\mu=(\bar{x}-\mu)^2+V$.)

Check back soon!

Problem 13

Suppose $f(\boldsymbol{\theta} \mid \mathbf{x})$ is the posterior pdf for some Bayesian estimation problem. For example, $\theta$ could represent the parameters of a regression model based on the data $\mathbf{x}$. An important use for the posterior pdf is to make predictions about the distribution of other random variables. For example, suppose the pdf of some random variable $\mathbf{Y}$ depends on $\theta$ via the conditional pdf $f(\mathbf{y} \mid \boldsymbol{\theta})$. The predictive pdf of $Y$ given $\mathbf{x}$ is defined as

$$
f(\mathbf{y} \mid \mathbf{x})=\int f(\mathbf{y} \mid \boldsymbol{\theta}) f(\boldsymbol{\theta} \mid \mathbf{x}) d \boldsymbol{\theta}
$$

which can be viewed as the expectation of $f(\mathbf{y} \mid \boldsymbol{\theta})$ under the posterior pdf. Therefore, we can use Monte Carlo simulation to approximate $f(\mathbf{y} \mid \mathbf{x})$ as

$$
f(\mathbf{y} \mid \mathbf{x}) \approx \frac{1}{N} \sum_{i=1}^N f\left(\mathbf{y} \mid \theta_i\right)
$$

where the sample $\left\{\boldsymbol{\theta}_i, i=1, \ldots, N\right\}$ is obtained from $f(\boldsymbol{\theta} \mid \mathbf{x})$; for example, via MCMC.
As a concrete application, suppose that the independent measurement data: $-0.4326,-1.6656,0.1253,0.2877,-1.1465$ come from some $\mathrm{N}\left(\mu, \sigma^2\right)$ distribution. Define $\theta=\left(\mu, \sigma^2\right)$. Let $Y \sim \mathrm{~N}\left(\mu, \sigma^2\right)$ be a new measurement. Estimate and draw the predictive pdf $f(y \mid \mathbf{x})$ from a sample $\boldsymbol{\theta}_1, \ldots, \boldsymbol{\theta}_N$ obtained via the Gibbs sampler of Problem 6.12. Take $N=10,000$. Compare this with the "common-sense" Gaussian pdf with expectation $\bar{x}$ (sample mean) and variance $s^2$ (sample variance).

Check back soon!

01:54

Problem 14

In the zero-inflated Poisson (ZIP) model, random data $X_1, \ldots, X_n$ are assumed to be of the form $X_i=R_i Y_i$, where the $\left\{Y_i\right\}$ have a $\operatorname{Poi}(\lambda)$ distribution and the $\left\{R_i\right\}$ have a $\operatorname{Ber}(p)$ distribution, all independent of each other. Given an outcome $\mathbf{x}=\left(x_1, \ldots, x_n\right)$, the objective is to estimate both $\lambda$ and $p$. Consider the following hierarchical Bayes model:
- $p \sim \mathrm{U}(0,1) \quad$ (prior for $p$ ),
- $(\lambda \mid p) \sim \operatorname{Gamma}(a, b) \quad$ (prior for $\lambda$ ),
- ( $r_i \mid p, \lambda$ ) $\sim \operatorname{Ber}(p)$ independently (from the model above),
- ( $x_i \mid \mathbf{r}, \lambda, p$ ) $\sim \operatorname{Poi}\left(\lambda r_i\right)$ independently (from the model above),
where $\mathbf{r}=\left(r_1, \ldots, r_n\right)$ and $a$ and $b$ are known parameters. It follows that

$$
f(\mathbf{x}, \mathbf{r}, \lambda, p)=\frac{b^a \lambda^{a-1} \mathrm{e}^{-b \lambda}}{\Gamma(a)} \prod_{i=1}^n \frac{\mathrm{e}^{-\lambda r_i}\left(\lambda r_i\right)^{x_i}}{x_{i}!} p^{r_i}(1-p)^{1-r_i}
$$

We wish to sample from the posterior pdf $f(\lambda, p, \mathbf{r} \mid \mathbf{x})$ using the Gibbs sampler.
a) Show that
1. $(\lambda \mid p, \mathbf{r}, \mathbf{x}) \sim \operatorname{Gamma}\left(a+\sum_i x_i, b+\sum_i r_i\right)$.
2. $(p \mid \lambda, \mathbf{r}, \mathbf{x}) \sim \operatorname{Beta}\left(1+\sum_i r_i, n+1-\sum_i r_i\right)$.
3. $\left(r_i \mid \lambda, p, \mathbf{x}\right) \sim \operatorname{Ber}\left(\frac{p \mathrm{e}^{-\lambda}}{p \mathrm{e}^{-\lambda}+(1-p) I_{\left\{x_i=0\right\}}}\right)$.
b) Generate a random sample of size $n=100$ for the ZIP model using parameters $p=0.3$ and $\lambda=2$.
c) Implement the Gibbs sampler, generate a large (dependent) sample from the posterior distribution and use this to construct $95 \%$ Bayesian CIs for $p$ and $\lambda$ using the data in b). Compare these with the true values.

Dominador Tan

Numerade Educator

Problem 15

' Show that $\mu$ in (6.15) satisfies the local balance equations

$$
\mu(\mathbf{x}, \mathbf{y}) \mathbf{R}\left[(\mathbf{x}, \mathbf{y}),\left(\mathbf{x}^{\prime}, \mathbf{y}^{\prime}\right)\right]=\mu\left(\mathbf{x}^{\prime}, \mathbf{y}^{\prime}\right) \mathbf{R}\left[\left(\mathbf{x}^{\prime}, \mathbf{y}^{\prime}\right),(\mathbf{x}, \mathbf{y})\right]
$$

Thus $\mu$ is stationary with respect to $\mathbf{R}$, that is, $\mu \mathbf{R}=\mu$. Show that $\mu$ is also stationary with respect to $\mathbf{Q}$. Show, finally, that $\mu$ is stationary with respect to $\mathbf{P}=\mathbf{Q R}$.

Check back soon!

Problem 16

* This is to show that the systematic Gibbs sampler is a special case of the generalized Markov sampler. Take $\mathscr{B}$ to be the set of indices $\{1, \ldots, n\}$, and define for the $Q$-step

$$
\mathbf{Q}_{\mathbf{x}}\left(y, y^{\prime}\right)= \begin{cases}1 & \text { if } y^{\prime}=y+1 \text { or } y^{\prime}=1, y=n \\ 0 & \text { otherwise }\end{cases}
$$

Let the set of possible transitions $\mathscr{R}(\mathbf{x}, y)$ be the set of vectors $\left\{\left(\mathbf{x}^{\prime}, y\right)\right\}$ such that all coordinates of $\mathbf{x}^{\prime}$ are the same as those of $\mathbf{x}$ except for possibly the $y$-th coordinate.
a) Show that the stationary distribution of $\mathbf{Q}_{\mathbf{x}}$ is $q_{\mathbf{x}}(y)=1 / n$, for $y=1, \ldots, n$.
b) Show that

$$
\mathbf{R}\left[(\mathbf{x}, y),\left(\mathbf{x}^{\prime}, y\right)\right]=\frac{f\left(\mathbf{x}^{\prime}\right)}{\sum_{(\mathbf{z}, y) \in \mathscr{R}(\mathbf{x}, y)} f(\mathbf{z})}, \quad \text { for }\left(\mathbf{x}^{\prime}, y\right) \in \mathscr{R}(\mathbf{x}, y)
$$

c) Compare with Algorithm 6.4.1.

Check back soon!

Problem 17

Prove that the Metropolis-Hastings algorithm is a special case of the generalized Markov sampler. (Hint: let the auxiliary set $\mathscr{G}$ be a copy of the target set $\mathscr{X}$, let $\mathbf{Q}_{\mathbf{x}}$ correspond to the transition function of the Metropolis-Hastings algorithm (that is, $\left.\mathbf{Q}_{\mathbf{x}}(\cdot, \mathbf{y})=q(\mathbf{x}, \mathbf{y})\right)$, and define $\mathscr{R}(\mathbf{x}, \mathbf{y})=\{(\mathbf{x}, \mathbf{y}),(\mathbf{y}, \mathbf{x})\}$. Use arguments similar to those for the Markov jump sampler (see (6.20)) to complete the proof.)

Check back soon!

Problem 18

Barker's and Hastings' MCMC algorithms differ from the symmetric Metropolis sampler only in that they define the acceptance ratio $\alpha(\mathbf{x}, \mathbf{y})$ to be respectively $f(\mathbf{y}) /(f(\mathbf{x})+ f(\mathbf{y}))$ and $s(\mathbf{x}, \mathbf{y}) /(1+1 / \varrho(\mathbf{x}, \mathbf{y}))$ instead of $\min \{f(\mathbf{y}) / f(\mathbf{x}), 1\}$. Here $\varrho(\mathbf{x}, \mathbf{y})$ is defined in (6.6) and $s$ is any symmetric function such that $0 \leqslant \alpha(\mathbf{x}, \mathbf{y}) \leqslant 1$. Show that both are special cases of the generalized Markov sampler. (Hint: take $\mathscr{Y}=\mathscr{X}$.)

Check back soon!

04:42

Problem 19

Implement the simulated annealing algorithm for the $n$-queens problem suggested in Example 6.13. How many solutions can you find?

Morgan Cheatham

Numerade Educator

Problem 20

Implement the Metropolis-Hastings based simulated annealing algorithm for the TSP in Example 6.12. Run the algorithm on some test problems in
http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/

Check back soon!

Problem 21

Write a simulated annealing algorithm based on the random walk sampler to maximize the function

$$
S(x)=\left|\frac{\sin ^8(10 x)+\cos ^5(5 x+1)}{x^2-x+1}\right|, \quad x \in \mathbb{R}
$$

Use a $\mathrm{N}\left(x, \sigma^2\right)$ proposal function, given the current state $x$. Start with $x=0$. Plot the current best function value against the number of evaluations of $S$ for various values of $\sigma$ and various annealing schedules. Repeat the experiments several times to assess what works best.

Check back soon!