Monte Carlo Gradient Estimation in Machine Learning

Monte Carlo Methods and Stochastic Optimisation

The mean-value analysis problem:

F(θ):=p(x;θ)f(x;ϕ)dx=Ep(x;θ)f(x;ϕ),(1)\mathcal{F}(\theta) := \int p(x; \theta) f(x; \phi) \text{d}x = \mathbb{E}_{p(x; \theta)} f(x; \phi), \enspace (1)

where f(x;ϕ)f(x; \phi) is the cost with parameters ϕ\phi, p(x;θ)p(x;\theta) is the measure, a probability distribution that is continuous in its domain and differentiable with θ\theta.

This objective is very common in variational inference, reinforcement learning, etc.

To learn the distribution parameter θ\theta, we need to consider the gradient:

η:=θF(θ)=θEp(x;θ)f(x;ϕ).(2)\eta := \nabla_{\theta} \mathcal{F}(\theta) = \nabla_{\theta} \mathbb{E}_{p(x; \theta)} f(x; \phi). \enspace (2)

η\eta is called the sensitivity analysis of F\mathcal{F}.

Problems:

  • not able to evaluate the expectation in closed form;
  • xx: high-dimensional, θ\theta: high-dimensional;
  • the cost funtion may not be differential or a black-box function.

Monte Carlo Estimators

We can numerically evaluate the integral by first drawing independent samples x^(1),...,x^(N)\hat{x}^{(1)}, ..., \hat{x}^{(N)} from the distribution p(x;θ)p(x; \theta), and then computing the averaged of the function evaluated at these samples:

FˉN=1Nn=1Nf(x^(n)),x^(n)p(x;θ),n=1,...,N.\bar{\mathcal{F}}_N = \frac{1}{N} \sum_{n=1}^{N} f(\hat{x}^{(n)}), \qquad \hat{x}^{(n)} \sim p(x; \theta), \quad n = 1,..., N.

FˉN\bar{\mathcal{F}}_N is a random variable and is called Monte Carlo estimator of eq. (1).

Remark 1. As long as we can write an integral in the form of eq. (1) (a product of a function and a distribution that we can easily sample from), we can apply the Monte Carlo method.

Four Properties

  • Consistency. As NN \to \infty, the estimator FˉNEp(x;θ)f(x;ϕ)\bar{\mathcal{F}}_N \to \mathbb{E}_{p(x; \theta)} f(x; \phi). This can be easily satisfied according to the strong law of large number.
  • Unbiasedness.

Ep(x;θ)[FˉN]=Ep(x;θ)[1Nn=1Nf(x^(n))]=1Nn=1NEp(x;θ)[f(x^(n))]=Ep(x;θ)[f(x)].\mathbb{E}_{p(x; \theta)} [\bar{\mathcal{F}}_N] = \mathbb{E}_{p(x; \theta)} [\frac{1}{N} \sum_{n=1}^{N} f(\hat{x}^{(n)})] = \frac{1}{N} \sum_{n=1}^{N} \mathbb{E}_{p(x; \theta)} [f(\hat{x}^{(n)})] = \mathbb{E}_{p(x; \theta)}[f(x)].

(if we repeat the estimation process many times, the estimate is centered on the the actual value of the integral on average)

  • Minimum variance. If we consider two unbiased estimators using the same number of sampling NN, we will prefer the estimator that has lower variance. [if MC estimator has the minimum variance?]
  • Computational efficiency. for example: a computational cost linear in the number of parameters, can be computed in parallel.

The central role of Gradient Estimation eq. (2)

some examples in different areas.

Variational Inference

Variational inference is a general method for approximating complex and unknown distributions by the closest distribution within a tractable family. Consider a generic probabilistic model p(xz)p(z)p(x|z)p(z) that defines a generative process in which observed data xx is generated from a set of unobserved variables z using a data distribution p(xz)p(x|z) and a prior
distribution p(z)p(z). The posterior distribution of this generative process p(zx)p(z|x) is unknown, and is approximated by a variational distribution q(zx;θ)q(z|x; \theta) with variational parameters θ\theta. The objective is

maxθ,ϕEq(zx;θ)[logp(xz;ϕ)logq(zx;θ)p(z)].\max_{\theta, \phi} \mathbb{E}_{q(z|x;\theta)} [\log p(x|z;\phi) - \log \frac{q(z|x;\theta)}{p(z)}].

Optimising the distribution qq requires the gradient of the objective with respect to the variational parameters θ\theta:

η=θEq(zx;θ)[logp(xz;ϕ)logq(zx;θ)p(z)].\eta = \nabla_{\theta}\mathbb{E}_{q(z|x;\theta)} [\log p(x|z;\phi) - \log \frac{q(z|x;\theta)}{p(z)}].

Reinforcement Learning

Model-free policy search is an area of reinforcement learning where we learn a policy|a distribution over actions|that on average maximises the accumulation of long-term rewards. Through interaction in an environment, we can generate trajectories τ=(s1,a1,s2,a2,...,sT,aT)\tau = (s_1, a_1, s_2, a_2, ... , s_T , a_T ) that consist of pairs of states st and actions at for time period
t=1,...,Tt = 1, ... , T. A policy is learnt by following the policy gradient:

η=θEp(τ;θ)[t=0Tγtr(st,at)]\eta = \nabla_{\theta} \mathbb{E}_{p(\tau;\theta)} [\sum_{t=0}^{T} \gamma^t r(s_t,a_t)]

The cost is the return over the trajectory, which is a weighted sum of rewards obtained at each time step r(st,at)r(s_t, a_t), with the discount facthor γ[0,1]\gamma \in [0, 1]. The measure is the joint distribution over states and actions p(τ;θ)=t=0T1[p(st+1st,at)p(atst;θ)]p(aTsT;θ)p(\tau;\theta) = \prod_{t=0}^{T-1} [p(s_{t+1}|s_t, a_t)p(a_t|s_t; \theta)]p(a_T |s_T ; \theta).

Intuitive Analysis of Gradient Estimators

The gradients θEp(x;θ)[f(x)]\nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)] can be computed in two ways:

  • Derivatives of Measure. The gradient can be computed by differentiation of the measure p(x;θ)p(x; \theta). Gradient estimators in this class include the score function estimator and the measure-valued gradient.

  • Derivatives of Paths. The gradient can be computed by differentiation of the cost f(x)f(x), which
    encodes the pathway from parameters θ\theta, through the random variable xx, to the cost value, such as the pathwise gradient, harmonic gradient estimators and finite dfferences, and Malliavin-weighted estimators.

We focus on three classes of gradient estimators: the score function, pathwise and measure-valued gradient estimators. All three estimators satisfy two desirable properties: consistent and unbiased; but they differ in their variance behaviour and in their computational cost.

Intuitive comparision

Consider the stochastic gradient problem (2) that uses Gaussian measures for three simple families
of cost functions, quadratics, exponentials and cosines:

η=θN(xμ,σ2)f(x;k)dx;θ{μ,σ};f{(xk)2,exp(kx2),cos(kx)}.\eta = \nabla_{\theta} \int \mathcal{N}(x | \mu, \sigma^2) f(x; k) \text{d}x; \quad \theta \in \{\mu, \sigma \}; \quad f \in \{ (x-k)^2, \exp(-kx^2), \cos(kx)\}.

The computational cost:

  • Both of the score-function and pathwise estimators can be computed using a single sample in the Monte Carlo estimator (N = 1), even for multivariate distributions, making them computationally cheap.

  • The measure-valued derivative estimator will require 2D evaluations of the cost function for D dimensional parameters, and for this the reason will typically not be preferred in high-dimensional settings.

  • If the cost function is not differentiable, then the pathwise gradient will not be applicable.

Several criteria to be judged when choosing
an unbiased gradient estimator:

  • computational cost;
  • implications on the use of differentiable and
    non-differentiable cost functions;
  • the change in behaviour as the cost itself changes;
  • the availability of effective variance reduction techniques to achieve low variance.

Score Function Gradient Estimators (likelihood ratio method, REINFORCE estimator)

Score function

The score function is the derivative of the log-probability of the distribution θlogp(x;θ)\nabla_{\theta} \log p(x; \theta) with respect to its parameters θ\theta:

θlogp(x;θ)=θp(x;θ)p(x;θ).\nabla_{\theta} \log p(x; \theta) = \frac{\nabla_{\theta} p(x; \theta)}{p(x; \theta)}.

properties:
1, Its expectation is zero:

Ep(x;θ)[θlogp(x;θ)]=p(x;θ)θp(x;θ)p(x;θ)dx=θp(x;θ)dx=θ1=0.\mathbb{E}_{p(x; \theta)} [\nabla_{\theta} \log p(x; \theta)] = \int p(x; \theta) \frac{\nabla_{\theta} p(x; \theta)}{p(x; \theta)} \text{d}x = \nabla_{\theta} \int p(x; \theta) \text{d}x = \nabla_{\theta} 1 = 0.

2, Its variance is the Fisher information matrix.

Using the score function, we can derive a general-purpose estimator for the sensitivity analysis of eq. (2):

η=θEp(x;θ)[f(x)]=θp(x;θ)f(x)dx=f(x)θp(x;θ)dx=p(x;θ)f(x)θlogp(x;θ)dx=Ep(x;θ)[f(x)θlogp(x;θ)].\begin{aligned} \eta &= \nabla_{\theta} \mathbb{E}_{p(x; \theta)} [f(x)] \\ &= \nabla_{\theta} \int p(x; \theta) f(x) \text{d}x \\ &= \int f(x) \nabla_{\theta} p(x; \theta) \text{d}x \\ &= \int p(x; \theta) f(x) \nabla_{\theta} \log p(x; \theta) \text{d}x \\ &= \mathbb{E}_{p(x; \theta)} [f(x) \nabla_{\theta} \log p(x; \theta)]. \end{aligned}

The form is what we need - a product of a distribution we can sample from and a function we can evaluate. Then, use the Monte Carlo estimator, we have

ηˉ=1Nn=1Nf(x^(n))θlogp(x^(n);θ),x^(n)p(x;θ).\bar{\eta} = \frac{1}{N} \sum_{n=1}^{N} f(\hat{x}^{(n)}) \nabla_{\theta} \log p(\hat{x}^{(n)}; \theta), \quad \hat{x}^{(n)} \sim p(x; \theta).

Notice that, in the third line, we have exchanged the order of the integral and the derivative. We will discuss the validity of this exchange later.

Estimator Properties

Unbiasedness

When the interchange between differentiation and integration is valid, we will obtain an
unbiased estimator of the gradient. Intuitively, since differentiation is a process of limits, the validity of the interchange will relate to the conditions for which it is possible to exchange limits and integrals, in such cases most often relying on the use of the dominated convergence theorem or the Leibniz integral rule (Flanders, 1973; Grimmett and Stirzaker, 2001). The interchange will be valid if the following conditions are satisfied:

(a) The measure p(x;θ)p(x; \theta) is continuously differentiable with respect to θ\theta;

(b) The product f(x)p(x;θ)f(x) p(x; \theta) is both integrable and differentiable for θ\theta;

© There exists an integrable function g(x)g(x) such that supθf(x)θp(x;θ)1g(x)\sup_{\theta} \|f(x) \nabla_{\theta} p(x; \theta) \|_1 \leq g(x) for x\forall x.

These assumptions usually hold in machine learning applications.

Abosolute Continuity

  • Example (Bounded support). Consider the score-function estimator for a cost f(x)=xf(x) = x and distribution p(x;θ)=1θ10<x<θp(x; \theta) = \frac{1}{\theta} 1_{0 < x < \theta}, which is differential in θ\theta when x(0,θ)x \in (0, \theta); the score function is θlogp(x;θ)=1θ\nabla_{\theta} \log p(x; \theta) = -\frac{1}{\theta}.
    The true gradient is: θEp(x;θ)[x]=θ(1θ0θx22)=12\nabla_{\theta} \mathbb{E}_{p(x;\theta)} [x] = \nabla_{\theta} (\frac{1}{\theta} \int_{0}^{\theta} \frac{x^2}{2}) = \frac{1}{2}.
    The score-funtion gradient is: Ep(x;θ)[x1θ]=θ/2θ=12\mathbb{E}_{p(x;\theta)} [x \frac{-1}{\theta}] = -\frac{\theta/2}{\theta} = -\frac{1}{2}.

    Why the score-function estimator fails to provide the correct gradient? The reason is that the measure is not absolutely continuous with respect to θ\theta at the boundary of the support.

Let’s explain the absolute continuity in detail.

θEp(x;θ)[f(x)]=θp(x;θ)f(x)dx=limh0p(x;θ+h)p(x;θ)hf(x)dx=limh01hp(x;θ)p(x;θ+h)p(x;θ)p(x;θ)f(x)dx=limh01hp(x;θ)(p(x;θ+h)p(x;θ)1)f(x)dx=limh01h(Ep(x;θ)[ω(θ,h)f(x)]Ep(x;θ)[f(x)])\begin{aligned} \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)] &= \int \nabla_{\theta} p(x;\theta) f(x) \text{d}x \\ &= \int \lim_{h \to 0} \frac{p(x; \theta + h) - p(x;\theta)}{h} f(x) \text{d}x \\ &= \lim_{h \to 0} \frac{1}{h} \int p(x; \theta) \frac{p(x; \theta + h) - p(x;\theta)}{p(x;\theta)} f(x) \text{d}x\\ &= \lim_{h \to 0} \frac{1}{h} \int p(x; \theta) (\frac{p(x; \theta + h)}{p(x;\theta)}-1) f(x) \text{d}x\\ &= \lim_{h \to 0} \frac{1}{h} (\mathbb{E}_{p(x;\theta)}[\omega(\theta, h) f(x)] - \mathbb{E}_{p(x;\theta)}[f(x)]) \end{aligned}

where the ratio ω(θ,h):=p(x;θ+h)p(x;θ)\omega(\theta, h) := \frac{p(x; \theta+h)}{p(x;\theta)}. The estimator makes an implicit assumption of absolute continuity, where we require p(x;θ+h)>0p(x; \theta+h) > 0 for all points where p(x;θ)>0p(x; \theta) > 0. Not all distributions of interest satisfy this property, and failures of absolute continuity can result in a biased gradient.

Estimator Variance

Define the estimator mean as μ(θ):=Ep(x;θ)[ηˉN]\mu(\theta) := \mathbb{E}_{p(x;\theta)}[\bar{\eta}_N], for N=1N=1, The variance of the score function estimator is:

Varp(x;θ)(ηˉ1)=Ep(x;θ)[(f(x)θlogp(x;θ))2]μ(θ)2,\begin{aligned} \text{Var}_{p(x;\theta)}(\bar{\eta}_1) &= \mathbb{E}_{p(x;\theta)}[(f(x) \nabla_{\theta} \log p(x; \theta))^2] - \mu(\theta)^2, \end{aligned}

or

Varp(x;θ)(ηˉ1)=limh01hEp(x;θ)[(ω(θ,h)1)2f(x)2]μ(θ)2,suph(μ(θ+h)μ(θ))2Ep(x;θ)[ω(θ,h)1]2.\begin{aligned} \text{Var}_{p(x;\theta)}(\bar{\eta}_1) &= \lim_{h \to 0} \frac{1}{h} \mathbb{E}_{p(x;\theta)}[(\omega(\theta, h) - 1)^2f(x)^2] - \mu(\theta)^2, \\ & \geq \sup_{h} \frac{(\mu(\theta + h) - \mu(\theta))^2}{\mathbb{E}_{p(x;\theta)}[\omega(\theta, h) - 1]^2}. \end{aligned}

Three sources of variance:

  • importance ratio ω(θ,h)\omega(\theta, h) (the need for absolute continuity);

  • The dimensionality of the parameters xx;

  • The variance of the cost function f(x)f(x).

Computational Cost

The computational cost of the score function estimator is low, it is the order of O(N(D+L))O(N(D+L)) for NN samples, DD dimensional distributional parameters θ\theta and LL is the cost of evaluating the cost function.

Conclusion

  • The score funtion only need the final value of the cost in its computation and it makes no assumptions about the internal structure of the cost function. Any type of cost function can be used.

  • The measure must be differentiable with respect to the parameters θ\theta, and we can easily sample from the measure. It is applicable to both discrete and continuous distributions.

  • The estimator can be implemented using only a single sample, making it computationally efficient.

  • There are too many sources of variance, we can use some variance reduction techniques to reduce the variance.

Pathwise Gradient Estimators

Sampling Paths (sampling process)

For continuous distribution, the alternative way to generate samples x^\hat{x} from the distribution p(x;θ)p(x; \theta) is to sample from a simpler base distribution p(ϵ)p(\epsilon) which is independent of the parameters θ\theta, and then transform the samples through a deterministic path g(ϵ;θ)g(\epsilon; \theta):

x^p(x;θ)x^=g(ϵ;θ),ϵp(ϵ).\hat{x} \sim p(x; \theta) \quad \equiv \quad \hat{x} = g(\epsilon; \theta), \quad \epsilon \sim p(\epsilon).

This transformation is described by the rule for the change of variables for probability:

p(x;θ)=p(ϵ)ϵg(ϵ;θ)1.p(x; \theta) = p(\epsilon) |\nabla_{\epsilon} g(\epsilon;\theta)|^{-1}.

One-liners [Reparameterization Trick]

One whidely-known example is sampling from a multivariate Gaussian distribution p(x;θ)=N(xμ,Σ)p(\bf x; \bf \theta) = \mathcal{N}(\mathbf x|\mathbf \mu, \mathbf \Sigma):

  1. First sample from a standard Gaussian distribution p(ϵ)=N(ϵ0,I)p(\mathbf \epsilon) = \mathcal{N}(\mathbf \epsilon|\mathbf 0, \mathbf I);
  2. Then transform the samples through the local-scale transformation g(ϵ;θ)=μ+Lϵg(\epsilon; \theta) = \mu + \mathbf L \epsilon, where LLT=Σ\mathbf{LL}^T = \mathbf \Sigma.

x^=μ+Lϵ,ϵN(0,I),LLT=Σ.\hat{x} = \mu + \mathbf L \epsilon, \quad \epsilon \sim \mathcal{N}(\mathbf 0, \mathbf I), \quad \mathbf L \mathbf L^T = \mathbf \Sigma.

Many such transformations exist for common distributions, including Dirichlet, Gamma, and many others. These types of transformations are called one-liners because they can be implemented in a single line of code.

The expectation of eq. (1) is then:

Ep(x;θ)[f(x)]=Ep(ϵ)[f(g(ϵ;θ))]\begin{aligned} \mathbb{E}_{p(x; \theta)} [f(x)] = \mathbb{E}_{p(\epsilon)} [f(g(\epsilon; \theta))] \end{aligned}

It is often used in Monte Carlo methods, and is called reparameterisation trick.

Gradient Estimators

Assume that we have a distribution p(x;θ)p(x; \theta) with known differentiable sampling path g(ϵ;θ)g(\epsilon; \theta) and base distribution p(ϵ)p(\epsilon). The gradient estimator for the sensitivity analysis of eq. (2) is:

η=θEp(x;θ)[f(x)]=θp(ϵ)f(g(ϵ;θ))dϵ=Ep(ϵ)[θf(g(ϵ;θ))]ηˉ=1Nn=1Nθf(g(ϵ^(n);θ)),ϵ^(n)p(ϵ).(3)\begin{aligned} \eta &= \nabla_{\theta} \mathbb{E}_{p(x; \theta)} [f(x)] \\ &= \nabla_{\theta} \int p(\epsilon) f(g(\epsilon; \theta)) \text{d}\epsilon \\ &= \mathbb{E}_{p(\epsilon)}[\nabla_{\theta} f(g(\epsilon; \theta))]\\ \bar{\eta} &= \frac{1}{N} \sum_{n=1}^{N} \nabla_{\theta} f(g(\hat{\epsilon}^{(n)}; \theta)), \quad \hat{\epsilon}^{(n)} \sim p(\epsilon). \qquad (3) \end{aligned}

Decoupling Sampling and Gradient Computation

The pathwise estimator (3) is limited to those distributions for which we simultaneously have a differential path and use this same path to generate samples. But, sampling from a distribution may not provide a differentiable path. Thus, we can expand the applicability of the pathwise gradient by decoupling these two processes.

The pathwise estimator can be rewritten in a more general form:

η=θEp(x;θ)[f(x)]=Ep(ϵ)[θf(x)x=g(ϵ;θ)]=p(ϵ)xf(x)x=g(ϵ;θ)θg(ϵ;θ)dϵ=p(x;θ)xf(x)θxdx=Ep(x;θ)[xf(x)θx].\begin{aligned} \eta &= \nabla_{\theta} \mathbb{E}_{p(\mathbf x;\theta)}[f(\mathbf x)] \\ &= \mathbb{E}_{p(\epsilon)}[\nabla_{\theta} f(\mathbf x)|_{\mathbf x = g(\epsilon;\theta)}] \\ &= \int p(\epsilon) \nabla_{\mathbf x} f(\mathbf x)|_{\mathbf x = g(\epsilon;\theta)} \nabla_{\theta} g(\epsilon; \theta) \text{d} \epsilon \\ &= \int p(\mathbf x;\theta) \nabla_{\mathbf x} f(\mathbf x)\nabla_{\theta} \mathbf x \text{d} \mathbf x \\ &= \mathbb{E}_{p(\mathbf x;\theta)}[\nabla_{\mathbf x} f(\mathbf x) \nabla_{\theta} \mathbf x]. \end{aligned}

One way to compute θx\nabla_{\theta} \mathbf x is to use θg(ϵ;θ)\nabla_{\theta} g(\epsilon; \theta), but this form is not always convenient.Another way to compute θx\nabla_{\theta} \mathbf x is to use the inverse of the path g1(x;θ)g^{-1}(x; \theta). g1(x;θ)g^{-1}(x; \theta) can be thought as the ‘standardisation path’ of the random variable-- that is the transformation that removes the dependence of the sampling on the distribution parameters, standardising it to a zero mean unit variance-like form.

Consider the equation ϵ=g1(x;θ)\epsilon = g^{-1}(x; \theta), evaluating the total derivative on both sides:

θϵ=θg1(x;θ)0=xg1(x;θ)θx+θg1(x;θ)thus,θx=(xg1(x;θ))1θg1(x;θ).\begin{aligned} \nabla_{\theta} \epsilon &= \nabla_{\theta} g^{-1}(x;\theta) \\ 0 &= \nabla_{x} g^{-1}(x;\theta) \nabla_{\theta} x + \nabla_{\theta} g^{-1}(x;\theta) \\ \text{thus,} \nabla_{\theta} x &= - (\nabla_{x} g^{-1}(x;\theta))^{-1} \nabla_{\theta} g^{-1}(x; \theta). \end{aligned}

In this form, we can apply pathwise gradient estimator to a far wider set of distributions and paths, such as for the Beta, Gamma, and Dirichlet distributions.

  • Example (Univariate distributions). For univariate distribution p(x;θ)p(x;\theta), we can use the sampling path given by the inverse CDF: x=g(ϵ;θ)=F1(ϵ;θ)x = g(\epsilon;\theta) = F^{-1}(\epsilon; \theta), where ϵU[0,1]\epsilon \sim \mathcal{U}[0, 1]. Computing the derivative θx=θF1(ϵ;θ)\nabla_{\theta} x = \nabla_{\theta} F^{-1} (\epsilon; \theta) is often complicated and expensive. We can obtain an alternative expression by considering the inverse path g1(x;θ)=F(x;θ)g^{-1}(x;\theta) = F(x; \theta), we have:

    θx=θF(x;θ)xF(x;θ)=θF(x;θ)p(x;θ).\nabla_{\theta} x = -\frac{\nabla_{\theta} F(x;\theta)}{\nabla_x F(x; \theta)} = - \frac{\nabla_{\theta} F(x;\theta)}{p(x; \theta)}.

Bias and variance

When deriving the pathwise estimator, we exploited an interchange between differentiation and integration. If this interchange is valid, then the estimator is unbiased.

The variance of the pathwise estimator can be shown to be bounded by the squared Lipschitz constant of the cost function f(x)f(x). (1) The variance bounds that exist are independent of the dimensionality of the parameters θ\theta, which means we can get low-variance gradient estimates, even in high-dimensional space. (2) As the cost funtion becomes highly-variable, i.e., the Lipschitz constant increases, the variance of the pathwise estimator can be higher than that of the score function methods.

The pathwise estimator will not always have lower variance when compared to other methods since the variance is bounded by the Lipschitz constant of the cost function.

Computational cost

The pathwise gradient estimator is restricted to differentiable cost functions, which is a limitation when compared to the score function estimator. Rapid convergence can be obtained even when using only a single sample to compute the gradient, as is often done in practice. There is a trade-off between the number of samples used and the Lipschitz constant of the cost function, and may require more samples to be used for functions with higher Lipschitz constants. This consideration is why we will find that regularisation that promotes smoothness of the functions we learn is important for successful applications.

The computational cost of the pathwise estimator is the same as the score function estimator and is of the order O(N(D+L))O(N(D+L)) for NN samples, DD dimensional distributional parameters θ\theta and LL is the cost of evaluating the cost function and its gradient.

Conclusion

  • The pathwise estimator is only applicable to differentiable cost functions.
  • When using the pathwise estimator, we do not need to know the measure p(x;θ)p(x; \theta), but we need to know the deterministic and differentiable sampling path g(ϵ;θ)g(\epsilon; \theta) and the base distribution p(ϵ)p(\epsilon).
  • The estimator can be implemented using only a single sample, making it computationally efficient.
  • We might need to control the smoothness of the cost function to ensure that the variance of the estimator is low, and may need to employ variance reduction techniques.

Gumbel softmax estimator

Measure-Valued Gradient Estimators

Weak Derivatives (measure-valued derivatives)

Consider the derivative of a density function p(x;θ)p(x; \theta) with respect to a single parameter θi\theta_i, with ii the index on the set of distributional parameters. The derivative θip(x;θ)\nabla_{\theta_i} p(x;\theta) is not a density, since it may have negative values and does not integrate to one. Using the properties of signed measures, we can always decompose this derivative into a difference of two densities, each multiplied by a constant:

θip(x;θ)=cθi+p+(x;θ)cθip(x;θ),\nabla_{\theta_i} p(x;\theta) = c_{\theta_i}^+ p^+(x;\theta) - c_{\theta_i}^- p^-(x;\theta),

where p+,pp^+, p^- are densities, referred to as the positive and negative parts of pp. By integrating both sides of the equation, we can obtain the constants cθi+,cθic_{\theta_i}^+, c_{\theta_i}^-:

θip(x;θ)dx=θip(x;θ)dx=0;cθi+p+(x;θ)cθip(x;θ)dx=cθi+cθi.\begin{aligned} &\int \nabla_{\theta_i} p(x;\theta) \text{d}x = \nabla_{\theta_i} \int p(x;\theta) \text{d}x = 0; \\ &\int c_{\theta_i}^+ p^+(x;\theta) - c_{\theta_i}^- p^-(x;\theta) \text{d}x = c_{\theta_i}^+ - c_{\theta_i}^- . \end{aligned}

Thus, we have:

cθi+=cθi:=cθic_{\theta_i}^+ = c_{\theta_i}^- := c_{\theta_i}

The decomposition of the derivative becomes:

θip(x;θ)=cθi(p+(x;θ)p(x;θ)).\nabla_{\theta_i} p(x;\theta) = c_{\theta_i} (p^+(x;\theta) - p^-(x;\theta)).

The triple (cθi,p+,p)(c_{\theta_i}, p^+, p^-) is called the i-th weak derivative of pp with respect to θi\theta_i.

For multivariate parameters θ\theta, each dimension has one triple.

  • The derivative is weak because we do not require the density to be differentiable.
  • The weak derivative is not unique, but always exists and can be obtained using the Hahn-Jordan decomposition of a signed measure into two measures that have complementary support. see signed measure.

Deriving the estimator

For D-dimensional parameters θ\theta, we can write the gradient of the expectation as:

ηi=θiEp(x;θ)[f(x)]=θip(x;θ)f(x)dx=θip(x;θ)f(x)dx=cθi(pi+(x;θ)pi(x;θ))f(x)dx=cθi(Epi+(x;θ)[f(x)]Epi(x;θ)[f(x)]).ηˉi,N=cθiN(n=1Nf(x^+(n))n=1Nf(x^(n))),x^+(n)pi+(x;θ),x^(n)pi(x;θ).\begin{aligned} \eta_i &= \nabla_{\theta_i} \mathbb{E}_{p(x;\theta)}[f(x)] = \nabla_{\theta_i} \int p(x;\theta) f(x) \text{d}x \\ &= \int \nabla_{\theta_i} p(x;\theta) f(x) \text{d}x \\ &= \int c_{\theta_i} (p_i^+(x;\theta) - p_i^-(x;\theta)) f(x) \text{d}x \\ &= c_{\theta_i} (\mathbb{E}_{p_i^+(x;\theta)}[f(x)] - \mathbb{E}_{p_i^-(x;\theta)}[f(x)]). \\ \bar{\eta}_{i, N} &= \frac{c_{\theta_i}}{N} (\sum_{n=1}^{N} f(\hat{x}^{+(n)}) - \sum_{n=1}^{N} f(\hat{x}^{-(n)})), \quad \hat{x}^{+(n)} \sim p_i^+(x;\theta), \quad \hat{x}^{-(n)} \sim p_i^-(x;\theta). \end{aligned}

The positive and negative components may be different depending on which parameter of the measure the derivative is taken with respect to. The constant cθic_{\theta_i} will also change depending the parameter being differentiated.

  • Example (Bernoulli measure-valued gradient). Consider the Bernoulli distribution p(x;θ)=θx(1θ)1xp(x;\theta) = \theta^x (1-\theta)^{1-x}, with x{0,1}x \in \{0, 1\} and θ[0,1]\theta \in [0, 1].
    By taking the derivative with respect to θ\theta, we have:

    θp(x;θ)f(x)dx=θ(θf(1)+(1θ)f(0))=f(1)f(0).\begin{aligned} \nabla_{\theta} \int p(x;\theta) f(x) \text{d}x &= \nabla_{\theta} (\theta f(1) + (1-\theta) f(0))\\ &= f(1) - f(0). \end{aligned}

    By weak derivative, we have:

θp(x;θ)f(x)dx=θp(x;θ)f(x)dx=δ1f(x)δ0f(x)dx=f(1)f(0).\begin{aligned} \nabla_{\theta} \int p(x;\theta) f(x) \text{d}x &= \int \nabla_{\theta} p(x;\theta) f(x) \text{d}x \\ &= \int \delta_1 f(x) - \delta_0 f(x) \text{d}x \\ &= f(1) - f(0). \end{aligned}

which is the same as the original gradient.

Vector case

If the measure is a factorised distribution p(x;θ)=dp(xdθd)p(x;\theta) = \prod_d p(x_d | \theta_d), then the positive component and negative component of the weak derivative will itself factorise across the dimensions. For the positive component, this decomposition will be pi+(x;θ)=p(xi)pi+(xi;θi)p^+_i(x;\theta) = p(x_{-i})p^+_i(x_i;\theta_i), which is the product of the marginal distribution p(xi)p(x_{-i}) and the positive component of the weak derivative with respect to θi\theta_i. The negative component will be pi(x;θ)=p(xi)pi(xi;θi)p^-_i(x;\theta) = p(x_{-i})p^-_i(x_i;\theta_i), which is the product of the marginal distribution p(xi)p(x_{-i}) and the negative component of the weak derivative with respect to θi\theta_i.

Estimator Properties

Domination

Remember in the score function estimator, we need the measure to be absolutely continuous with respect to θ\theta. We explored one example where we were unable to ensure domination, because no bounding constant applies at the boundaries of the domain. For weak derivatives, we can always ensure the correctness of the interchange between differentiation and integration: the fundamental property of weak derivatives states that if the triple (c,p+,p)(c, p^+, p^-) is the weak derivative of p(x;θ)p(x;\theta), then for every bounded continuous function f(x)f(x), we have:

θf(x)p(x;θ)dx=cθ[f(x)p+(x;θ)dxf(x)p(x;θ)dx].\nabla_{\theta} \int f(x) p(x;\theta) \text{d}x = c_{\theta} [\int f(x) p^+(x;\theta) \text{d}x - \int f(x) p^-(x;\theta) \text{d}x].

  • Example (Bounded support). Consider the measure-valued estimator for a cost function f(x)=xf(x) = x and distribution p(x;θ)=1θ1{0<x<θ}p(x; \theta) = \frac{1}{\theta} 1_{\{0 < x < \theta\}}, which is differential in θ\theta when x(0,θ)x \in (0, \theta); The measure-valued derivative is

    θf(x)U[0,θ](x)dx=θ(1θ0θf(x)dx)=1θf(θ)1θ20θf(x)dx=1θ(f(x)δθ(x)dxf(x)U[0,θ](x)dx)\begin{aligned} \nabla_{\theta} \int f(x) \mathcal{U}_{[0, \theta]}(x) \text{d}x &= \nabla_{\theta} (\frac{1}{\theta} \int_{0}^{\theta} f(x) \text{d}x) = \frac{1}{\theta} f(\theta) - \frac{1}{\theta^2} \int_{0}^{\theta} f(x) \text{d} x \\ &= \frac{1}{\theta} (\int f(x) \delta_{\theta}(x) \text{d}x - \int f(x) \mathcal{U}_{[0, \theta]}(x) \text{d}x) \\ \end{aligned}

The measure-valued derivative is given by the triple (1θ,δθ,U[0,θ])(\frac{1}{\theta}, \delta_{\theta}, \mathcal{U}_{[0, \theta]}). For specific cost function f(x)=xf(x) = x, we have:

The true gradient is: θEp(x;θ)[x]=θ(1θ0θx22)=12\nabla_{\theta} \mathbb{E}_{p(x;\theta)} [x] = \nabla_{\theta} (\frac{1}{\theta} \int_{0}^{\theta} \frac{x^2}{2}) = \frac{1}{2}.
The measure-valued gradient is: 1θ(Eδθ[x]EU[0,θ][x])=1θ(θθ2)=12.\frac{1}{\theta} (\mathbb{E}_{\delta_{\theta}} [x] - \mathbb{E}_{\mathcal{U}_{[0, \theta]}[x]}) = \frac{1}{\theta} (\theta - \frac{\theta}{2}) = \frac{1}{2}.

Bias and variance

For bounded and continuous cost functions ff, by using the fundamental property of weak derivatives, the measure-valued gradient estimator is unbiased.

The variance of the measure-valued gradient estimator is:

Varp(x;θ)[ηˉ1]=Varp+(x;θ)[f(x)]+Varp(x;θ)[f(x)]2Covp+(x;θ)p(x;θ)[f(x),f(x)].\begin{aligned} \text{Var}_{p(x;\theta)}[\bar{\eta}_1] = \text{Var}_{p^+(x;\theta)}[f(x)] + \text{Var}_{p^-(x;\theta)}[f(x)] - 2 \text{Cov}_{p^+(x';\theta)p^-(x;\theta)} [f(x'), f(x)]. \end{aligned}

  • The variance depends on the choice of decomposition of the weak derivative into positive and negative components.
  • If the random variables can be ‘coupled’ in some way, where they share the same underlying source of randomness, this will reduce the variance of the gradient estimator by increasing the covariance term. The most common way is to sample the variables xx' and xx using common random numbers. Another way is to use variance reduction techniques.
  • Figure 5 shows that the measure valued estimator is not sensitive to the dimensionality of the parameters θ\theta. It is however sensitive to the magnitude of the function.

Computational cost

Measure-valued gradients are much more computationally expensive than the score-function or pathwise gradients. This is because the gradient we computed is the gradient for a single parameter: for every parameter we require two evaluations of the cost function to compute its gradient.
It is this structure of adapting the underlying sampling distributions for each parameter that leads
to the low variance of the estimator but at the same time makes its application to high-dimensional
parameter spaces prohibitive.

The computational cost of the measure-valued gradient estimator is the order of O(2NDL)O(2NDL) for NN samples, DD dimensional distributional parameters θ\theta and LL is the cost of evaluating the cost function.

Conclusion

  • The measure-valued estimator can be used with any type of cost function, differentiable or not. As long as we can evaluate the cost function repeatedly for different inputs.
  • It is applicable to both discrete and continuous distributions.
  • Computationally expensice in high-dimensional parameter spaces.
  • We need methods to sample from the positive and negative measures.
  • Using the weak derivative requires manual derivation of the decomposition at first, although for many common distributions the weak-derivative decompositions are known.

Variance Reduction Techniques

The gradient variance is one of the principal sources of performance issues. This paper introduces four common methods to reduce the variance of gradient estimators: large-samples, coupling, conditioning, and control variates.

Large-Samples

The simplest way to reduce the variance of the gradient estimator is to use more samples. The variance of an estimators will shrinks as O(1N)O(\frac{1}{N}), where NN is the number of samples. However, the computational cost will increase linearly with the number of samples. The computational cost can be reduces by parallelising the computation of the gradient across multiple processors. Sometimes, increasing the number of Monte Carlo samples will not be an option, such as when the cost function involves a real-world experiment or interaction with a user.

Coupling and Common random numbers

When consider the difference between two expectations of a function f(x)f(x) under different but closely-related distributions p1(x)p_1(x) and p2(x)p_2(x):

η=Ep1(x)[f(x)]Ep2(x)[f(x)]\begin{aligned} \eta = \mathbb{E}_{p_1(x)}[f(x)] - \mathbb{E}_{p_2(x)}[f(x)] \end{aligned}

The direct method to compute the difference is to estimate each expectation separately using Monte Carlo sampling, and then compute the difference between the two estimates:

ηˉind=1Nn=1Nf(x^1(n))1Nn=1Nf(x^2(n)),\begin{aligned} \bar{\eta}_{ind} = \frac{1}{N} \sum_{n=1}^{N} f(\hat{x}_1^{(n)}) - \frac{1}{N} \sum_{n=1}^{N} f(\hat{x}_2^{(n)}), \end{aligned}

where x^1(n)p1(x)\hat{x}_1^{(n)} \sim p_1(x) and x^2(n)p2(x)\hat{x}_2^{(n)} \sim p_2(x).

We can achieve a simple form of variance reduction by coupling x^1(n)\hat{x}_1^{(n)} and x^2(n)\hat{x}_2^{(n)}, so that each pair (x^1(n),x^2(n))(\hat{x}_1^{(n)}, \hat{x}_2^{(n)}) is sampled from some joint distribution p12(x1,x2)p_{12}(x_1, x_2) with marginals p1(x)p_1(x) and p2(x)p_2(x). The variance of the coupled estimator is:

Varp12(x1,x2)[ηˉcpl]=Varp12(x1,x2)[f(x1)f(x2)]=Varp1(x1)[f(x1)]+Varp2(x2)[f(x2)]2Covp12(x1,x2)[f(x1),f(x2)]=Varp1(x1)p2(x2)[ηˉind]2Covp1(x1)[f(x1),f(x2)].\begin{aligned} \text{Var}_{p_12(x_1, x_2)}[\bar{\eta}_{cpl}] &= \text{Var}_{p_12(x_1, x_2)}[f(x_1) - f(x_2)] \\ &= \text{Var}_{p_1(x_1)}[f(x_1)] + \text{Var}_{p_2(x_2)}[f(x_2)] - 2 \text{Cov}_{p_12(x_1, x_2)}[f(x_1), f(x_2)] \\ &= \text{Var}_{p_1(x_1)p_2(x_2)}[\bar{\eta}_{ind}] - 2 \text{Cov}_{p_1(x_1)}[f(x_1), f(x_2)]. \end{aligned}

Thus, to reduce the variance we need to choose a coupling p12(x1,x2)p_{12}(x_1, x_2) such that f(x1)f(x_1) and f(x2)f(x_2) are positively correlated. The most common way to achieve this is to use common random numbers when p1(x1)p_1(x_1) and p2(x2)p_2(x_2) are close or in a related family of distributions. This means that the random numbers used to generate x1x_1 and x2x_2 are the same. For example, in the univariate case, we can first sample uU[0,1]u \sim \mathcal{U}[0,1] and then apply the inverse CDF transformation to obtain x1=Fp11(u)x_1 = F_{p_1}^{-1}(u) and x2=Fp21(u)x_2 = F_{p_2}^{-1}(u).

  • Coupling may not always reduce the variance of the measure valued estimator, depending on the cost function.

Conditioning

Rao-Blackwellisation is a variance reduction technique that probabilistically conditions our estimator on a subset of dimensions and integrates out the remaining dimensions.

Assume that the dimensions {1,...,D}\{1,..., D\} of xx are partitioned into a set of dimensions S\mathcal{S} and its complement Sc=1,...,D S\mathcal{S}^c = {1,...,D}\ \mathcal{S}. The expectation g(xSc)=Ep(xS)[f(x)xSc]g(x_{\mathcal{S}^c}) = \mathbb{E}_{p(x_{\mathcal{S}})}[f(x)|x_{\mathcal{S}^c}]. We can estimate Ep(x)[f(x)]\mathbb{E}_{p(x)}[f(x)] by performing Monte Carlo integration over the dimensions Sc\mathcal{S}^c:

gˉN=1Nn=1Ng(x^Sc(n))=1Nn=1NEp(xS)[f(x)x^Sc(n)],x^Sc(n)p(xSc).\begin{aligned} \bar{g}_{N} &= \frac{1}{N} \sum_{n=1}^{N} g(\hat{x}_{\mathcal{S}^c}^{(n)}) \\ &= \frac{1}{N} \sum_{n=1}^{N} \mathbb{E}_{p(x_{\mathcal{S}})}[f(x)|\hat{x}_{\mathcal{S}^c}^{(n)}], \quad \hat{x}_{\mathcal{S}^c}^{(n)} \sim p(x_{\mathcal{S}^c}). \end{aligned}

By law of total expectation Var(Y)=E[Var(YX)]+Var(E[YX])\text{Var}(Y) = \mathbb{E}[\text{Var}(Y | X)] + \text{Var}(\mathbb{E}[Y | X]), we have:

Varp(x)[f(x)]=Ep(xSc)[Varp(xS)[f(x)xSc]]+Varp(xSc)[Ep(xS)[f(x)xSc]]=Ep(xSc)[Varp(xS)[f(x)xSc]]+Varp(xSc)[g(xSc)]Varp(xSc)[g(xSc)]\begin{aligned} \text{Var}_{p(x)}[f(x)] &= \mathbb{E}_{p(x_{\mathcal{S}^c})}[\text{Var}_{p(x_{\mathcal{S}})}[f(x)|x_{\mathcal{S}^c}]] + \text{Var}_{p(x_{S^c})}[\mathbb{E}_{p(x_{\mathcal{S}})}[f(x)|x_{\mathcal{S}^c}]] \\ &= \mathbb{E}_{p(x_{\mathcal{S}^c})}[\text{Var}_{p(x_{\mathcal{S}})}[f(x)|x_{\mathcal{S}^c}]] + \text{Var}_{p(x_{\mathcal{S}^c})}[g(x_{\mathcal{S}^c})] \\ & \geq \text{Var}_{p(x_{\mathcal{S}^c})}[g(x_{\mathcal{S}^c})] \end{aligned}

Thus, for unconditional one fˉN=1Nn=1Nf(x^(n))\bar{f}_N = \frac{1}{N}\sum_{n=1}^{N} f(\hat{x}^{(n)}), we have:

Varp(x)[fˉN]=1NVarp(x)[f(x)]1NVarp(xSc)[g(xSc)]=Varp(xSc)[gˉN].\begin{aligned} \text{Var}_{p(x)}[\bar{f}_N] = \frac{1}{N} \text{Var}_{p(x)}[f(x)] \geq \frac{1}{N} \text{Var}_{p(x_{\mathcal{S}^c})}[g(x_{\mathcal{S}^c})] = \text{Var}_{p(x_{\mathcal{S}^c})}[\bar{g}_{N}]. \end{aligned}

  • Conditional estimator has lower variance than the unconditional estimator.
  • This technique is useful in practice only if we can compute the conditional expectation g(xSc)g(x_{\mathcal{S}^c}) efficiently.

Control Variates

Since all the gradient estimators have the same form Ep(x;θ)[f(x)]\mathbb{E}_{p(x;\theta)}[f(x)], we will focus on this general form. The strategy is to replace the function f(x)f(x) in the expectation with a substitute function f~(x)\tilde{f}(x) whose expectation is the same as Ep(x;θ)[f(x)]\mathbb{E}_{p(x;\theta)}[f(x)], but whose variance is lower.

If we have a function h(x)h(x) with a known expectation Ep(x;θ)[h(x)]\mathbb{E}_{p(x;\theta)}[h(x)], then we can construct a new function

f~(x)=f(x)β(h(x)Ep(x;θ)[h(x)]).\tilde{f}(x) = f(x) - \beta(h(x)-\mathbb{E}_{p(x;\theta)}[h(x)]).

Here h(x)h(x) is a control variate. β\beta is a coefficient that affects the strength of the control variate.
Then we can get a control variate estimator:

ηˉN=1Nn=1Nf~(x^(n))=fˉβ(hˉEp(x;θ)[h(x)]).\begin{aligned} \bar{\eta}_{N} &= \frac{1}{N} \sum_{n=1}^{N} \tilde{f}(\hat{x}^{(n)}) \\ &= \bar{f} - \beta (\bar{h} - \mathbb{E}_{p(x;\theta)}[h(x)]). \end{aligned}

Bias, consistency and variance

(1) Unbiasedness. The control variate estimator is unbiased:

Ep(x;θ)[ηˉN]=Ep(x;θ)[fˉβ(hˉEp(x;θ)[h(x)])]=Ep(x;θ)[fˉ]=Ep(x;θ)[f(x)].\begin{aligned} \mathbb{E}_{p(x;\theta)}[\bar{\eta}_{N}] &= \mathbb{E}_{p(x;\theta)}[\bar{f} - \beta (\bar{h} - \mathbb{E}_{p(x;\theta)}[h(x)]) ]\\ &= \mathbb{E}_{p(x;\theta)}[\bar{f}] \\ &= \mathbb{E}_{p(x;\theta)}[f(x)]. \end{aligned}

(2) Consistency. The control variate estimator is consistent:

limNηˉN=Ep(x;θ)[f~(x)]=Ep(x;θ)[f(x)].\lim_{N \to \infty} \bar{\eta}_{N} = \mathbb{E}_{p(x;\theta)}[\tilde{f}(x)] = \mathbb{E}_{p(x;\theta)}[f(x)].

(3) Variance. The variance of the control variate estimator is (N=1):

Varp(x;θ)[f~]=Varp(x;θ)[fβ(hEp(x;θ)[h(x)])]=Varp(x;θ)[f]+β2Varp(x;θ)[h]2βCovp(x;θ)[f,h].\begin{aligned} \text{Var}_{p(x;\theta)}[\tilde{f}] &= \text{Var}_{p(x;\theta)}[f - \beta (h - \mathbb{E}_{p(x;\theta)}[h(x)]) ]\\ &= \text{Var}_{p(x;\theta)}[f] + \beta^2 \text{Var}_{p(x;\theta)}[h] - 2 \beta \text{Cov}_{p(x;\theta)}[f, h]. \end{aligned}

By minimising the right-hand side of the equation with respect to β\beta, we can obtain the optimal value of β\beta:

β=Covp(x;θ)[f,h]Varp(x;θ)[h]=Varp(x;θ)[f]Varp(x;θ)[h]Corr(f,h).\beta^* = \frac{\text{Cov}_{p(x;\theta)}[f, h]}{\text{Var}_{p(x;\theta)}[h]} = \sqrt{\frac{\text{Var}_{p(x;\theta)}[f]}{\text{Var}_{p(x;\theta)}[h]}} \text{Corr}(f, h).

Using the optimal value of β\beta, the potential variance reduction is:

Varp(x;θ)[f~]Varp(x;θ)[f]=Varp(x;θ)[fβ(hEp(x;θ)[h(x)])]Varp(x;θ)[f]=1Corr(f,h)21.\begin{aligned} \frac{\text{Var}_{p(x;\theta)}[\tilde{f}]}{\text{Var}_{p(x;\theta)}[f]} = \frac{\text{Var}_{p(x;\theta)}[f - \beta (h - \mathbb{E}_{p(x;\theta)}[h(x)])]}{\text{Var}_{p(x;\theta)}[f]} = 1 - \text{Corr}(f, h)^2 \leq 1. \end{aligned}

  • The stronger the correlation between ff and hh, the greater the potential variance reduction.
  • In practice, the optimal β\beta^* will not be known and it can be estimated using the same NN samples. But the samples used to estimate hˉ\bar{h} will introduce a bias because βˉN\bar{\beta}_N and hˉ\bar{h} will no longer be independent. In practice, thi bias is often negligible and can be controlled since it decreases quickly as the number of samples NN increases.

Multiple and Non-linear Controls

Designing Control Variates

(1) Baselines. One simple way to reduce the variance of a score-function gradient estimator is to use the score function itself as a control variate, since its expectation under the measure is zero. The modified estimator is:

ηˉN=1Nn=1N(f(x^(n))β)θlogp(x^n;θ),x^(n)p(x;θ)\begin{aligned} \bar{\eta}_{N} &= \frac{1}{N} \sum_{n=1}^{N} (f(\hat{x}^{(n)}) - \beta)\nabla_{\theta} \log p(\hat{x}^n;\theta), \hat{x}^{(n)} \sim p(x;\theta) \\ \end{aligned}

In reinforcement learning, β\beta is called a baseline and it can be estimated with a running average of the cost. While this approach is easier to implement than
optimising β\beta to minimise variance, it is not optimal and does not guarantee lower variance compared to the vanilla score-function estimator.

(2) Bounds. We can use bounds on the cost function ff as ways of specifying the form of the control variate hh. This is intuitive because it maintains a correlation between ff and hh, and if chosen well, may be easily integrable against the measure and available in closed form. This approach
requires more knowledge of the cost function, since we will need to characterise the cost analytically in some way to bound it. In general, unless the bounds used are tight, they will not be effective as control variates, since the gap between the bound and the true function is not controllable and will not necessarily give the information needed for variance reduction.

(3) Delta method. The delta method is a way of constructing a control variate by using the Taylor expansion of the cost function. This requires a cost function that is differentiable so that we can compute the
second-order Taylor expansion, but can be an effective and very general approach for variance reduction that allows easy implementation. It can be used for variance reduction in both the score-function estimator (Paisley et al., 2012) and the pathwise estimator(Miller et al., 2017).

  • Example. Define γ(x)\gamma(x) is the gradient of the cost function f(x)f(x), H(x)H(x) is the Hessian of the cost function f(x)f(x). The second-order Taylor expansion of a cost function expand around point μ\mu and its derivate are

    h(x)=f(μ)+(xμ)Tγ(μ)+12(xμ)TH(μ)(xμ),xh(x)=γ(μ)T+(xμ)TH(μ).\begin{aligned} h(x) &= f(\mu) + (x - \mu)^T \gamma(\mu) + \frac{1}{2} (x - \mu)^T H(\mu) (x - \mu), \\ \nabla_{x} h(x) &= \gamma(\mu)^T + (x - \mu)^T H(\mu). \end{aligned}

    We can use this expansion directly as a control variate for the score-function estimator:

ηˉSF=θEp(x;θ)[f(x)]=θEp(x;θ)[f(x)βTh(x)]+βTθEp(x;θ)[h(x)]=Ep(x;θ)[(f(x)βTh(x))θlogp(x;θ)]+βTθEp(x;θ)[h(x)].\begin{aligned} \bar{\eta}_{SF} &= \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)] \\ &= \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x) - \beta^T h(x)] + \beta^T \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[h(x)] \\ &= \mathbb{E}_{p(x;\theta)}[(f(x) - \beta^T h(x))\nabla_{\theta} \log p(x;\theta)] + \beta^T \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[h(x)]. \end{aligned}

In the Gaussian mean-field variational inference that Paisley et al. (2012) consider, the second term is known in closed-form and hence does not require Monte Carlo approximation. β\beta is a multivariate control coefficient and is estimated separately.

For the pathwise estimator, using the sampling path x=g(ϵ;θ)x = g(\epsilon; \theta), we have:

ηˉPW=θEp(x;θ)[f(x)]=θEp(x;θ)[f(x)βTh(x)]+βTθEp(x;θ)[h(x)]=θEp(x;θ)[f(g(ϵ;θ))βTh(g(ϵ;θ))]+βTθEp(x;θ)[h(x)]=Ep(ϵ)[xf(x)θg(ϵ;θ)βxh(x)θg(ϵ;θ)]+βTθEp(x;θ)[h(x)].\begin{aligned} \bar{\eta}_{PW} &= \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x)] \\ &= \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(x) - \beta^T h(x)] + \beta^T \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[h(x)] \\ &= \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[f(g(\epsilon; \theta)) - \beta^T h(g(\epsilon; \theta))] + \beta^T \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[h(x)] \\ &= \mathbb{E}_{p(\epsilon)}[\nabla_{x}f(x) \nabla_{\theta} g(\epsilon; \theta) - \beta \nabla_x h(x)\nabla_{\theta} g(\epsilon; \theta)] + \beta^T \nabla_{\theta} \mathbb{E}_{p(x;\theta)}[h(x)]. \end{aligned}

Assume that the final term is known in closed-form and does not require stochastic approximation.

Guidance in Choosing Gradient Estimators

The authors provide some guidance in choosing gradient estimators.

  • If our estimation problem involves continuous functions and measures that are continuous in the domain, then using the pathwise estimator is a good default. It is relatively easy to implement and its default implementation, without additional variance reduction, will typically have variance that is low enough so as not to interfere with the optimisation.

  • If the cost function is not differentiable or is a black-box function then the score-function or the measure-valued gradients are available. If the number of parameters is low, then the measure-valued gradient will typically have lower variance and would be preferred. But if we have a high-dimensional parameter set, then the score-function estimator should be used.

  • If we have no control over the number of times we can evaluate a black-box cost function, effectively only allowing a single evaluation of it, then the score function is the only estimator of the three we reviewed that is applicable.

  • The score-function estimator should, by default, always be implemented with at least some basic variance reduction. The simplest option is to use a baseline control variate estimated with a running average of the cost value.
    • When using the score-function estimator, some attention should be paid to the dynamic range of the cost function and its variance, and ways found to keep its value bounded within a reasonable range, e.g. by transforming the cost so that it is zero mean, or using a baseline.

  • For all estimators, track the variance of the gradients if possible and address high variance by using a larger number of samples from the measure, decreasing the learning rate, or clipping the gradient values. It may also be useful to restrict the range of some parameters to avoid extreme values, e.g. by clipping them to a desired interval.

  • The measure-valued gradient should be used with some coupling method for variance reduction. Coupling strategies that exploit relationships between the positive and negative components of the density decomposition, and which have shared sampling paths, are known for the commonly-used distributions.

  • If we have several unbiased gradient estimators, a convex combination of them might have lower variance than any of the individual estimators.

  • If the measure is discrete on its domain then the score-function or measure-valued gradient are available. The choice will again depend on the dimensionality of the parameter space.

  • In all cases, we strongly recommend having a broad set of tests to verify the unbiasedness of the gradient estimator when implemented.

Reference

(https://pages.stat.wisc.edu/~shao/stat609/stat609-07.pdf)

https://bochang.me/blog/posts/measure-val-grad/

signed measure

  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.
  • Copyrights © 2021-2026 Xue Yu
  • Visitors: | Views: