Convergence of Markov Chain

Total Variation Distance

Define Total Variation Distance:

dTV(μ(x),v(x))=12xΩμ(x)v(x)d_{TV}(\mu(x), v(x)) = \frac{1}{2} \sum_{x \in \Omega} |\mu(x) - v(x)|

This is equivalent to the following:

dTV(μ(x),v(x))=xΩ(v(x)μ(x))=xΩ+(μ(x)v(x))d_{TV}(\mu(x), v(x)) = \sum_{x \in \Omega^{-}} (v(x) - \mu(x)) = \sum_{x \in \Omega^{+}} (\mu(x) - v(x))

where Ω+={xΩ:μ(x)v(x)}\Omega^{+} = \{x \in \Omega: \mu(x) \geq v(x)\}, Ω={xΩ:μ(x)<v(x)}\Omega^{-} = \{x \in \Omega: \mu(x) < v(x)\},
and

dTV(μ(x),v(x))=maxSΩμ(S)v(S)d_{TV}(\mu(x), v(x)) = \max_{S \subset \Omega} |\mu(S) - v(S)|

where μ(S)=xSμ(x)\mu(S) = \sum_{x \in S} \mu(x), v(S)=xSv(x)v(S) = \sum_{x \in S} v(x).

Proof:
Define Ω+={xΩ:μ(x)v(x)}\Omega^{+} = \{x \in \Omega: \mu(x) \geq v(x)\}, Ω={xΩ:μ(x)<v(x)}\Omega^{-} = \{x \in \Omega: \mu(x) < v(x)\}, then

dTV(μ(x),v(x))=12xΩμ(x)v(x)=12xΩ+(μ(x)v(x))+12xΩ(v(x)μ(x))\begin{aligned} d_{TV}(\mu(x), v(x)) &= \frac{1}{2} \sum_{x \in \Omega} |\mu(x) - v(x)| \\ &= \frac{1}{2} \sum_{x \in \Omega^{+}} (\mu(x) - v(x)) + \frac{1}{2} \sum_{x \in \Omega^{-}} (v(x) - \mu(x)) \end{aligned}

Since xΩμ(x)=1=xΩ+μ(x)+xΩμ(x)\sum_{x \in \Omega} \mu(x) = 1 = \sum_{x \in \Omega^{+}} \mu(x) + \sum_{x \in \Omega^{-}} \mu(x), we have

dTV(μ(x),v(x))=12xΩ+(μ(x)v(x))+12xΩ(v(x)μ(x))=12xΩ(v(x)μ(x))+12xΩ(v(x)μ(x))=xΩ(v(x)μ(x))=xΩ+(μ(x)v(x))\begin{aligned} d_{TV}(\mu(x), v(x)) &= \frac{1}{2} \sum_{x \in \Omega^{+}} (\mu(x) - v(x)) + \frac{1}{2} \sum_{x \in \Omega^{-}} (v(x) - \mu(x)) \\ &= \frac{1}{2} \sum_{x \in \Omega^{-}} (v(x) - \mu(x)) + \frac{1}{2} \sum_{x \in \Omega^{-}} (v(x) - \mu(x)) \\ &= \sum_{x \in \Omega^{-}} (v(x) - \mu(x)) \\ &= \sum_{x \in \Omega^{+}} (\mu(x) - v(x)) \end{aligned}

If S=Ω+S = \Omega^{+} or Ω\Omega^{-}, then

dTV(μ(x),v(x))=12xΩμ(x)v(x)=maxSΩxSμ(x)v(x)\begin{aligned} d_{TV}(\mu(x), v(x)) &= \frac{1}{2} \sum_{x \in \Omega} |\mu(x) - v(x)| \\ &= \max_{S \in \Omega} \sum_{x \in S} |\mu(x) - v(x)| \\ \end{aligned}

If SS contains any elements xΩx \in \Omega^{-}, then dTV(μ(x),v(x))=xS(μ(x)v(x))xΩ+(μ(x)v(x))d_{TV}(\mu(x), v(x)) = |\sum_{x \in S} (\mu(x) - v(x))| \leq \sum_{x \in \Omega^{+}} (\mu(x) - v(x)) when S=Ω+S = \Omega^{+}.

Convergence of Markov Chain

Assume we start from state xx, run Markov chain for tt steps, then we get the distribution PxtP_{x}^{t}. If we want to prove PxtP_{x}^{t} converges to stationary distribution π\pi, we need to prove dTV(Pxt,π)0d_{TV}(P_{x}^{t}, \pi) \rightarrow 0 as tt \rightarrow \infty. Thus, we need to bound dTV(Pxt,π)d_{TV}(P_{x}^{t}, \pi).

Define dx(t):=dTV(Pxt,π)d_{x}(t) := d_{TV}(P_{x}^{t}, \pi).

Consider all possible initial states xΩx \in \Omega, we define
d(t):=maxxΩdTV(Pxt,π)d(t) := \max_{x \in \Omega} d_{TV} (P_{x}^{t}, \pi).

Assume there are two Markov chains xtPxtx_{t} \sim P_{x}^{t}, ytPyty_{t} \sim P_{y}^{t}. xtx_{t} and yty_{t} share the same transition probability matrix PP but start from different states. Then we define

dˉ(t):=maxx,yΩdTV(Pxt,Pyt).\bar{d}(t) := \max_{x, y \in \Omega} d_{TV}(P_{x}^{t}, P_{y}^{t}).

Next, we will prove this Lemma:

Lemma 1.

d(t)dˉ(t)2d(t).d(t) \leq \bar{d}(t) \leq 2 d(t).

Proof: Let’s prove the second inequality dˉ(t)2d(t)\bar{d}(t) \leq 2 d(t).

dˉ(t)=maxx,yΩdTV(Pxt,Pyt)=maxx,yΩ12zΩPxt(z)Pyt(z)=maxx,yΩ12zΩPxt(z)π(z)+π(z)Pyt(z)maxx,yΩ[12zΩPxt(z)π(z)+12zΩπ(z)Pyt(z)]=maxxΩdTV(Pxt,π)+maxyΩdTV(Pyt,π)=d(t)+d(t)=2d(t)\begin{aligned} \bar{d}(t) &= \max_{x, y \in \Omega} d_{TV}(P_{x}^{t}, P_{y}^{t}) \\ &= \max_{x, y \in \Omega} \frac{1}{2} \sum_{z \in \Omega} |P_{x}^{t}(z) - P_{y}^{t}(z)| \\ & = \max_{x, y \in \Omega} \frac{1}{2} \sum_{z \in \Omega} |P_{x}^{t}(z) - \pi(z) + \pi(z) - P_{y}^{t}(z)| \\ &\leq \max_{x, y \in \Omega} [\frac{1}{2} \sum_{z \in \Omega} |P_{x}^{t}(z) - \pi(z)| + \frac{1}{2} \sum_{z \in \Omega} |\pi(z) - P_{y}^{t}(z)| ]\\ &= \max_{x \in \Omega} d_{TV}(P_{x}^{t}, \pi) + \max_{y \in \Omega} d_{TV}(P_{y}^{t}, \pi) \\ &= d(t) + d(t) \\ &= 2 d(t) \end{aligned}

For the first inequality dˉ(t)2d(t)\bar{d}(t) \leq 2 d(t), we need to prove d(t)dˉ(t)d(t) \leq \bar{d}(t).

Define

Sx,y=argmaxSΩPxt(S)Pyt(S)S_{x,y}^{*} = \arg\max_{S \subset \Omega} (P_{x}^{t}(S) - P_{y}^{t}(S))

Sx=argmaxSΩ,yΩPxt(S)Pyt(S)S_{x}^{**} = \arg\max_{S \subset \Omega, y \in \Omega} (P_{x}^{t}(S) - P_{y}^{t}(S))

Sx=argmaxSΩyΩπ(y)Pxt(S)Pyt(S)S_{x}^{*} = \arg\max_{S \subset \Omega} \sum_{y \in \Omega} \pi(y)(P_{x}^{t}(S) - P_{y}^{t}(S))

d(t)=maxxΩdTV(Pxt,π)=maxxΩmaxSΩPxt(S)π(S)=maxxΩmaxSΩ+(Pxt(S)π(S))=maxxΩmaxSΩ+(Pxt(S)yΩπ(y)Pyt(S))(π(y)=xΩπ(x)Px,y)=maxxΩmaxSΩ+yΩπ(y)(Pxt(S)Pyt(S)).\begin{aligned} d(t) &= \max_{x \in \Omega} d_{TV}(P_{x}^{t}, \pi) \\ &= \max_{x \in \Omega} \max_{S \subset \Omega}|P_{x}^{t}(S) - \pi(S)| \\ &= \max_{x \in \Omega} \max_{S \subset \Omega^{+}}(P_{x}^{t}(S) - \pi(S)) \\ &= \max_{x \in \Omega} \max_{S \subset \Omega^{+}}(P_{x}^{t}(S) - \sum_{y \in \Omega} \pi(y) P_{y}^{t}(S)) \qquad (\pi(y) = \sum_{x \in \Omega} \pi(x) P_{x,y}) \\ &= \max_{x \in \Omega} \max_{S \subset \Omega^{+}} \sum_{y \in \Omega} \pi(y) (P_{x}^{t}(S) - P_{y}^{t}(S)). \end{aligned}

Since for all SS:
Pxt(Sx,y)Pyt(Sx,y)Pxt(S)Pyt(S)(P_{x}^{t}(S_{x,y}^{*}) - P_{y}^{t}(S_{x,y}^{*}))\geq (P_{x}^{t}(S) - P_{y}^{t}(S))

we have,

d(t)=maxxΩmaxSΩ+yΩπ(y)(Pxt(S)Pyt(S))maxxΩyΩπ(y)(Pxt(Sx,y)Pyt(Sx,y))=maxxΩyΩπ(y)maxS(Pxt(S)Pyt(S))maxxΩyΩπ(y)(Pxt(Sx)Pyt(Sx))=maxxΩyΩπ(y)maxy,S(Pxt(S)Pyt(S))=maxx,yΩmaxSΩ(Pxt(S)Pyt(S))(yΩπ(y)=1)=maxx,yΩdTV(Pxt,Pyt)=dˉ(t).\begin{aligned} d(t) &= \max_{x \in \Omega} \max_{S \subset \Omega^{+}} \sum_{y \in \Omega} \pi(y) (P_{x}^{t}(S) - P_{y}^{t}(S)) \\ & \leq \max_{x \in \Omega} \sum_{y \in \Omega} \pi(y) (P_{x}^{t}(S_{x,y}^{*}) - P_{y}^{t}(S_{x,y}^{*})) \\ &= \max_{x \in \Omega} \sum_{y \in \Omega} \pi(y) \max_{S} (P_{x}^{t}(S) - P_{y}^{t}(S)) \\ & \leq \max_{x \in \Omega} \sum_{y \in \Omega} \pi(y) (P_{x}^{t}(S_{x}^{**}) - P_{y}^{t}(S_{x}^{**})) \\ & = \max_{x \in \Omega} \sum_{y \in \Omega} \pi(y) \max_{y, S}(P_{x}^{t}(S) - P_{y}^{t}(S)) \\ &= \max_{x, y \in \Omega}\max_{ S \subset \Omega}(P_{x}^{t}(S) - P_{y}^{t}(S)) \qquad (\sum_{y \in \Omega} \pi(y) = 1) \\ &= \max_{x, y \in \Omega} d_{TV}(P_{x}^{t}, P_{y}^{t}) \\ &= \bar{d}(t). \end{aligned}

that is, d(t)dˉ(t)2d(t)d(t) \leq \bar{d}(t) \leq 2d(t).

What we have proved is that, d(t)d(t) is bounded by dˉ(t)\bar{d}(t), and dˉ(t)\bar{d}(t) is controlled by maxx,yΩdTV(Pxt,Pyt)\max_{x,y \in \Omega} d_{TV}(P_{x}^{t}, P_{y}^{t}). Let’s find a way to bound dTV(Pxt,Pyt)d_{TV}(P_{x}^{t}, P_{y}^{t})!

Define 1. (Coupling) Assume x,yΩx , y \in \Omega, xμ,yvx \sim \mu, y \sim v, μ,v\mu, v are two distributions. A joint distribution w(x,y)w(x, y) on Ω×Ω\Omega \times \Omega is called couping if xΩ\forall x \in \Omega, yw(x,y)=μ(x)\sum_{y}w(x, y) = \mu(x), yΩ\forall y \in \Omega, xw(x,y)=v(y)\sum_{x} w(x, y) = v(y).

Lemma 2. Consider μ,v\mu, v defined on Ω\Omega,

a. For any coupling w(x,y)w(x, y) of μ,v\mu, v, dTV(μ,v)P(xy)d_{TV}(\mu, v) \leq P(x \ne y).

b. There always exists a coupling w(x,y)w(x, y) of μ,v\mu, v such that dTV(μ,v)=P(xy)d_{TV}(\mu, v) = P(x \ne y).

Proof: a. z\forall z, w(z,z)yΩw(z,y)=μ(z)w(z, z) \leq \sum_{y \in \Omega} w(z, y) = \mu(z). Similarly, w(z,z)v(z)w(z, z) \leq v(z). Thus, w(z,z)min(μ(z),v(z))w(z, z) \leq \min(\mu(z), v(z)).

P(xy)=1P(x=y)=1zΩw(z,z)1zΩmin(μ(z),v(z))=zΩμ(z)zΩmin(μ(z),v(z))=zΩ+(μ(z)v(z))=dTV(μ,v)\begin{aligned} P(x \ne y) &= 1 - P(x = y) \\ &= 1 - \sum_{z \in \Omega} w(z, z) \\ &\geq 1 - \sum_{z \in \Omega} \min(\mu(z), v(z)) \\ &= \sum_{z \in \Omega} \mu(z) - \sum_{z \in \Omega} \min(\mu(z), v(z)) \\ &= \sum_{z \in \Omega^{+}} (\mu(z) - v(z)) \\ &= d_{TV}(\mu, v) \end{aligned}

b. We can construct a coupling w(x,y)w(x, y):

w(x,y)={min{μ(x),v(y)},x=y(μ(x)w(x,x))(v(y)w(y,y))1zΩw(z,z),xyw(x, y) = \left\{ \begin{aligned} & \min \{\mu(x), v(y)\}, \quad x = y \\ &\frac{(\mu(x) - w(x, x))(v(y) - w(y,y))}{1 - \sum_{z \in \Omega}w(z,z)}, \quad x \ne y \end{aligned} \right.

For this joint distribution, we have

P(xy)=1P(x=y)=1zΩw(z,z)=1zΩmin(μ(z),v(z))=zΩμ(z)zΩmin(μ(z),v(z))=zΩ+(μ(z)v(z))=dTV(μ,v)\begin{aligned} P(x \ne y) &= 1 - P(x = y) \\ &= 1 - \sum_{z \in \Omega} w(z, z) \\ &= 1 - \sum_{z \in \Omega} \min(\mu(z), v(z)) \\ &= \sum_{z \in \Omega} \mu(z) - \sum_{z \in \Omega} \min(\mu(z), v(z)) \\ &= \sum_{z \in \Omega^{+}} (\mu(z) - v(z)) \\ &= d_{TV}(\mu, v) \end{aligned}

Thus, this joint distribution w(x,y)w(x, y) satisfies dTV(μ,v)=P(xy)d_{TV}(\mu, v) = P(x \ne y). Now, we need to prove that this joint distribution w(x,y)w(x, y) is a coupling of μ,v\mu, v.

xΩ\forall x \in \Omega,

yΩw(x,y)=y=xmin{μ(x),v(y)}+yΩ,yx(μ(x)w(x,x))(v(y)w(y,y))1zΩw(z,z)=w(x,x)+(μ(x)w(x,x))1zΩw(z,z)yx(v(y)w(y,y))=w(x,x)+(μ(x)w(x,x))1zΩw(z,z)(1v(x)(zw(z,z)w(x,x)))(yxv(y)=1v(x))=w(x,x)+(μ(x)w(x,x))1zΩw(z,z)(1zw(z,z)+w(x,x)v(x))(yxw(y,y)=zw(z,z)w(x,x))\begin{aligned} \sum_{y \in \Omega} w(x, y) &= \sum_{y=x} \min \{\mu(x), v(y)\} + \sum_{y \in \Omega, y \ne x} \frac{(\mu(x) - w(x, x))(v(y) - w(y,y))}{1 - \sum_{z \in \Omega}w(z,z)} \\ &= w(x, x) + \frac{(\mu(x) - w(x, x))}{1 - \sum_{z \in \Omega}w(z,z)} \sum_{y \ne x} (v(y) - w(y,y)) \\ &= w(x,x) + \frac{(\mu(x) - w(x, x))}{1 - \sum_{z \in \Omega}w(z,z)} (1 - v(x) - (\sum_{z} w(z,z) - w(x,x))) \qquad (\sum_{y \ne x} v(y) = 1 - v(x))\\ &= w(x,x) + \frac{(\mu(x) - w(x, x))}{1 - \sum_{z \in \Omega}w(z,z)} (1 - \sum_{z} w(z,z) + w(x,x) - v(x) ) \qquad (\sum_{y \ne x} w(y,y) = \sum_{z} w(z,z) - w(x, x)) \end{aligned}

If $ x \in \Omega^{+} = {x | \mu(x) \geq v(x)}$, then w(x,x)=v(x)w(x,x) = v(x), yΩw(x,y)=v(x)+(μ(x)v(x))=μ(x)\sum_{y \in \Omega} w(x, y) = v(x) + (\mu(x) - v(x)) = \mu(x).

If $ x \in \Omega^{-} = {x | \mu(x) < v(x)}$, then w(x,x)=μ(x)w(x,x) = \mu(x), yΩw(x,y)=μ(x)+0=μ(x)\sum_{y \in \Omega} w(x, y) = \mu(x) + 0 = \mu(x).

Thus, w(x,y)w(x, y) is a coupling of μ,v\mu, v.

Last but not least, let’s begin to prove the nonincreasing property of d(t)d(t)!! Almost close to the end!! \o^o/

Lemma 3. Consider two Markov chains xtPxtx^{t} \sim P_{x}^{t}, ytPyty^{t} \sim P_{y}^{t}, xtx^{t} and yty^{t} share the same transition probability matrix PP but start from different states. If xt=ytx^t = y^t, then xt+1=yt+1x^{t+1} = y^{t+1}, elif xtytx^t \ne y^t, then xt+1yt+1x^{t+1} \ne y^{t+1}, xt+1x^{t+1} and yt+1y^{t+1} are independent.

Define 2. (Coupling of Markov chains) Consider two Markov chains xtPxtx^{t} \sim P_{x}^{t}, ytPyty^{t} \sim P_{y}^{t}, xtx^{t} and yty^{t} share the same transition probability matrix PP but start from different states. If

P(yt+1xt,yt)=P(yt+1yt)P(y^{t+1} | x^{t}, y^{t}) = P(y^{t+1} | y^{t})

and

P(xt+1xt,yt)=P(xt+1xt)P(x^{t+1} | x^{t}, y^{t}) = P(x^{t+1} | x^{t})

then we say xtx^{t} and yty^{t} are coupled.

Define wt:=wt(xt,yt)w^{t} := w^{t}(x^t, y^t) is a coupling of Pxt,PytP_{x}^{t}, P_{y}^{t}, xtPxt,yPytx^t \sim P_x^t, y \sim P_{y}^{t}, and wtw^{t} satisfies Lemma 2(b).

Pwt(xt+1yt+1)=Pwt(xt+1yt+1xtyt)p(xtyt)+Pwt(xt+1yt+1xt=yt)p(xt=yt)P_{w^{t}}(x^{t+1} \ne y^{t+1}) = P_{w^{t}}(x^{t+1} \ne y^{t+1} | x^t \ne y^t) p(x^t \ne y^t) + P_{w^{t}}(x^{t+1} \ne y^{t+1} | x^t = y^t) p(x^t = y^t)

If xt=ytx^t = y^t, according to Lemma 3, Pwt(xt+1yt+1)=Pwt(xt+1=yt+1xt=yt)p(xt=yt)Pwt(xtyt)P_{w^{t}}(x^{t+1} \ne y^{t+1}) = P_{w^{t}}(x^{t+1} = y^{t+1} | x^t = y^t) p(x^t = y^t) \leq P_{w^{t}}(x^{t} \ne y^t);

If xtytx^t \ne y^t, Pwt(xt+1yt+1)=Pwt(xt+1yt+1xtyt)p(xtyt)Pwt(xtyt)P_{w^{t}}(x^{t+1} \ne y^{t+1}) = P_{w^{t}}(x^{t+1} \ne y^{t+1} | x^t \ne y^t) p(x^t \ne y^t) \leq P_{w^{t}}(x^{t} \ne y^t).

dx(t+1)=dTV(Pxt+1,Pyt+1Pwt(xt+1yt+1)Pwt(xtyt)=dTV(Pxt,Pyt)=dx(t)d_x(t+1) = d_{TV}(P_{x}^{t+1}, P_y^{t+1}) \leq P_{w^{t}}(x^{t+1} \ne y^{t+1}) \leq P_{w^{t}}(x^{t} \ne y^t) = d_{TV}(P_x^t, P_y^{t}) = d_x(t)

dx(t):=dTV(Pxt,π)d_{x}(t) := d_{TV}(P_{x}^{t}, \pi).

d(t)=maxxΩdx(t)d(t) = \max_{x \in \Omega} d_{x}(t).

d(t)dˉ(t)2d(t)d(t) \leq \bar{d}(t) \leq 2d(t).

dˉ(t):=maxx,yΩdTV(Pxt,Pyt)\bar{d}(t) := \max_{x, y \in \Omega} d_{TV}(P_{x}^{t}, P_{y}^{t})

Q: if we consider dTV(Pxt,π)d_{TV}(P_x^t, \pi), then the coupling wt:=wt(xt,y)w^{t} := w^{t}(x^t, y) should be a coupling of Pxt,πP_{x}^{t}, \pi, xtPxt,yπx^t \sim P_x^t, y \sim \pi.

π\pi is different from PytP_y^t at first.

dx(t)=dTV(Pxt,π)d(t)dˉ(t)=maxx,yΩdTV(Pxt,Pyt)=Pwt(xtyt)d_x(t) = d_{TV}(P_{x}^{t}, \pi) \leq d(t) \leq \bar{d}(t) = \max_{x, y \in \Omega} d_{TV}(P_{x}^{t}, P_{y}^{t}) = P_{w^t}(x^t \ne y^t)

Now, we have already proved that d(t)d(t) is nonincreasing. Next, we will prove that d(t)d(t) converges to 0.

Useful Lectures

  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.
  • Copyrights © 2021-2026 Xue Yu
  • Visitors: | Views: