Generative Models

Generative Models

Autoregressive Models, GAN, Flow-based Models, VAE

GAN: refer to From GAN to WGAN

VAE: refer to From Autoencoder to Beta-VAE

Flow-based Models: refer to Flow-based Deep Generative Models

Autoregressive Models: refer to Autoregressive Models

Reference: CS236 - Fall 2023 Deep Generative Models

Reference: EECE 571F (2023 Winter Term 1): Deep Learning with Structures

Energy-based Models (EBMs)

Parameterizing probability distributions

In generating models, we want to learn a probability distribution pθ(x)p_{\theta}(x), which closely matches the true data distribution pdata(x)p_{data}(x). The probability should satisfy the following two conditions:

  • non-negative: pθ(x)0p_{\theta}(x) \geq 0.

  • sum to one: pθ(x)dx=1\int p_{\theta}(x)dx = 1 or xp(x)=1\sum_{x}p(x) = 1.

It’s not hard to choose a non-negative function, for example, given any function fθ(x)f_{\theta}(x), we can choose gθ(x)=fθ(x)2,gθ=exp(fθ(x)),gθ(x)=fθ(x)g_{\theta}(x) = f_{\theta}(x)^2, g_{\theta} = \exp(f_{\theta}(x)), g_{\theta}(x) = |f_{\theta}(x)|, etc. However, gθ(x)g_{\theta}(x) might not sum to one. The solution is to normalize gθ(x)g_{\theta}(x) by dividing the sum of gθ(x)g_{\theta}(x) over all possible xx.

pθ(x)=gθ(x)xgθ(x)=gθ(x)gθ(x)dx=gθ(x)Z(θ),p_{\theta}(x) = \frac{g_{\theta}(x)}{\sum_{x}g_{\theta}(x)} = \frac{g_{\theta}(x)}{\int g_{\theta}(x) \text{d}x} = \frac{g_{\theta}(x)}{Z(\theta)},

where Z(θ)Z(\theta) is called the Partition function / Normalization constant.

Example:

  • Gaussian: g(μ,σ)(x)=e(xμ)22σ2g_{(\mu, \sigma)}(x) = e^{-\frac{(x-\mu)^2}{2 \sigma^2}}, volume is Z(μ,σ)=e(xμ)22σ2dx=2πσ2Z(\mu, \sigma) = \int e^{-\frac{(x-\mu)^2}{2 \sigma^2}} \text{d}x = \sqrt{2 \pi \sigma^2}.

  • Exponential: gλ(x)=eλxg_{\lambda}(x) = e^{-\lambda x}, volume is Z(λ)=0eλxdx=1/λZ(\lambda) = \int_0^{\infty} e^{-\lambda x} \text{d}x = 1/\lambda.

  • Exponential family: gθ(x)=h(x)eθTT(x)g_{\theta}(x) = h(x) e^{\theta^T T(x)}, volume is Z(θ)=h(x)eθTT(x)dxZ(\theta) = \int h(x) e^{\theta^T T(x)} \text{d}x.

  • Beta, Poisson, Gamma, Dirichlet, etc.

Generally, we can choose gθ(x)g_{\theta}(x) so that Z(θ)Z(\theta) is analytically. But how about using the models that Z(θ)Z(\theta) is not easy to compute analytically?

Energy-based Models

EBMs has the following form:

pθ(x)=1exp(fθ(x))dxefθ(x)=1Z(θ)efθ(x).p_{\theta}(x) = \frac{1}{\int \exp(f_{\theta}(x)) \text{d}x}e^{f_{\theta}(x)} = \frac{1}{Z(\theta)} e^{f_{\theta}(x)}.

Why do we choose fθ(x)f_{\theta}(x) as the form of efθ(x)e^{f_{\theta}(x)}?

  • We want to capture large variations in probability. We usually to use log-probability.
  • Exponential families. Many distributions can be written in this form.
  • Some physical meaning. fθ(x)- f_{\theta}(x) is called the energy.

Pros:

  • We can use any function fθ(x)f_{\theta}(x) to parameterize the probability distribution.
  • Stable training.
  • Relatively high sample quality.

Cons:

  • Sampling from pθ(x)p_{\theta}(x) is hard.
  • Evaluating and optimizing likelihood is pθ(x)p_{\theta}(x) is hard.
  • Curse of dimensionality. Computing Z(θ)Z(\theta) numerically scales exponentially with the dimensionality of xx.

EBMs with Discrete Observable Variables and Discrete Latent Variables: Restricted Boltzmann machine(RBM)

Suppose we have binary visible units xx, binary hidden units(latent variables) hh, the energy function is:

E(x,h)=aTxbThxTWhE(x, h) = - a^T x - b^T h - x^T W h

where a,b,Wa, b, W are parameters.

The probability distribution is:

p(x,h)=1ZeE(x,h)=1ZeaTx+bTh+xTWhp(x, h) = \frac{1}{Z} e^{-E(x, h)} = \frac{1}{Z} e^{a^T x + b^T h + x^T W h}

where Z=x,heE(x,h)Z = \sum_{x, h} e^{-E(x, h)}.


Why restricted?

  • Only one layer of hidden units.
  • No connections between hidden units.

Bipartite graph: conditional independence

p(xh)=i=1Dp(xih)p(hx)=j=1Hp(hjx)\begin{aligned} p(x|h) &= \prod_{i=1}^D p(x_i|h) \\ p(h|x) &= \prod_{j=1}^H p(h_j|x) \end{aligned}

Formally, we have

p(xh=h~)exp(Eθ(x,h=h~))exp(a~Tx)=iexp(a~ixi)\begin{aligned} p(x|h = \tilde{h}) \propto \exp(- E_{\theta}(x, h = \tilde{h})) \propto \exp(-\tilde{a}^Tx) = \prod_{i} \exp(-\tilde{a}_i x_i) \end{aligned}

Inference: Gibbs Sampling

In inference, we want to compute the maximum a posterior(MAP) p(hx)p(h|x) and computing the marginals p(x)p(x).

  • Due to the conditional independence, we can compute p(hx)p(h|x) in parallel.
  • But, the marginal p(x)p(x) is intractable. We need to use Markov Chain Monte Carlo(MCMC) to sample from p(x)p(x).

Gibbs sampling is a special case of MCMC. It can draw samples from p(x1,x2,...,xn)p(x_1, x_2,...,x_n) by iteratively sampling from the conditional distributions p(xix1,x2,...,xi1,xi+1,...,xn)p(x_i|x_1, x_2,...,x_{i-1}, x_{i+1},...,x_n).

In RBM, we do not iterative over individual variables. Instead, we do block-Gibbs sampling, i.e., sampling a block of variables conditioned on the other block.

Given initial sample (x(0),h(0))(x^{(0)}, h^{(0)}),

for t=1,2,...,Tt = 1, 2, ..., T:

h(t)p(hx=x(t1)),x(t)p(xh=h(t)),\begin{aligned} h^{(t)} &\sim p(h|x = x^{(t-1)}), \\ x^{(t)} &\sim p(x|h = h^{(t)}), \end{aligned}

Return (x(T),h(T))(x^{(T)}, h^{(T)}).

For both p(hx)p(h|x) and p(xh)p(x|h), we can sampling in parallel.

Remark: the Gibbs sampler can generate random variables from a (marginal) distribution indirectly. After sampling many iterations, (x(T),h(T))(x^{(T)}, h^{(T)}) follows the distribution p(x,h)p(x, h), x(T)x^{(T)} follows the marginal distribution p(x)p(x), and h(T)h^{(T)} follows the marginal distribution p(h)p(h). For more detials:

Explaining the Gibbs sampler

Markov Chain Monte Carlo.Gibbs Sampler.

Learning: Contrastive Divergence

In RBMs, we want to learn the parameters θ\theta by maximizing the summed log-likelihood of the training data logpθ(x)\log p_{\theta}(x). The problem is that the partition function Z(θ)Z(\theta) is intractable. Contrastive divergence(CD) is a method to approximate the gradient of the log-likelihood.

Since

pθ(x)θ=1pθ(x)pθ(x)θ=1pθ(x)pθ(x,h)dhθ=1pθ(x)pθ(x,h)θdh=1pθ(x)1Zexp(Eθ(x,h))θdh=1pθ(x)(1Z2exp(Eθ(x,h))Zθ1Zexp(Eθ(x,h))Eθ(x,h)θ)dh=1pθ(x)1ZZθpθ(x,h)dh1pθ(x)Eθ(x,h)θpθ(x,h)dh=1ZZθpθ(hx)dhEθ(x,h)θpθ(hx)dh=1ZZθEθ(x,h)θpθ(hx)dh=1Zexp(Eθ(x,h))dxdhθEpθ(hx)[Eθ(x,h)θ]=(Eθ(x,h)θ)pθ(x,h)dxdhEpθ(hx)[Eθ(x,h)θ]=Epθ(hx)[Eθ(x,h)θ]Epθ(x,h)[Eθ(x,h)θ]\begin{aligned} \frac{\partial p_{\theta}(x)}{\partial \theta} &= \frac{1}{p_{\theta}(x)} \frac{\partial p_{\theta}(x)}{\partial \theta} \\ &= \frac{1}{p_{\theta}(x)} \frac{\partial \int p_{\theta}(x, h)\text{d}h}{\partial \theta} \\ &= \frac{1}{p_{\theta}(x)} \int \frac{\partial p_{\theta}(x, h)}{\partial \theta}\text{d}h \\ &= \frac{1}{p_{\theta}(x)} \int \frac{\frac{1}{Z} \exp(-E_{\theta}(x, h))}{\partial \theta} \text{d}h \\ &= \frac{1}{p_{\theta}(x)} \int (-\frac{1}{Z^2} \exp(-E_{\theta}(x, h)) \frac{\partial Z}{\partial \theta} - \frac{1}{Z} \exp(-E_{\theta}(x, h)) \frac{\partial E_{\theta}(x, h)}{\partial \theta}) \text{d}h \\ &= -\frac{1}{p_{\theta}(x)} \int \frac{1}{Z} \frac{\partial Z}{\partial \theta} p_{\theta}(x, h) \text{d}h - \frac{1}{p_{\theta}(x)} \int \frac{\partial E_{\theta}(x, h)}{\partial \theta} p_{\theta}(x, h) \text{d}h \\ &= - \int \frac{1}{Z} \frac{\partial Z}{\partial \theta} p_{\theta}(h | x) \text{d}h - \int \frac{\partial E_{\theta}(x, h)}{\partial \theta} p_{\theta}(h | x) \text{d}h \\ &= - \frac{1}{Z} \frac{\partial Z}{\partial \theta} - \int \frac{\partial E_{\theta}(x, h)}{\partial \theta} p_{\theta}(h | x) \text{d}h \\ &= - \frac{1}{Z} \frac{\partial \int \int \exp(-E_{\theta}(x, h)) \text{d}x \text{d}h}{\partial \theta} - \mathbb{E}_{p_{\theta}(h|x)}[\frac{\partial E_{\theta}(x, h)}{\partial \theta}] \\ &= -\int \int (-\frac{\partial E_{\theta}(x, h)}{\partial \theta}) p_{\theta}(x,h) \text{d}x \text{d}h - \mathbb{E}_{p_{\theta}(h|x)}[\frac{\partial E_{\theta}(x, h)}{\partial \theta}] \\ &= \mathbb{E}_{p_{\theta}(h|x)}[-\frac{\partial E_{\theta}(x, h)}{\partial \theta}] - \mathbb{E}_{p_{\theta}(x, h)}[-\frac{\partial E_{\theta}(x, h)}{\partial \theta}] \\ \end{aligned}

Here we don’t know the distribution pθ(hx)p_{\theta}(h|x). Maximizing the summed log-likelihood of the training data logpθ(x)\log p_{\theta}(x) is equivalent to minimizing the KL divergence between the real data distribution pdata(x)p_{data}(x) and the model distribution pθ(x)p_{\theta}(x):

minθKL(pdata(x)pθ(x))=minθpdata(x)logpdata(x)dxpdata(x)logpθ(x)dx.\begin{aligned} \min_{\theta} \text{KL}(p_{data}(x) || p_{\theta}(x)) = \min_{\theta} \int p_{data}(x) \log p_{data}(x) \text{d}x - \int p_{data}(x) \log p_{\theta}(x) \text{d}x. \end{aligned}

Since the entropy of pdata(x)p_{data}(x) is:

H(pdata(x))=pdata(x)logpdata(x)dx\begin{aligned} H(p_{data}(x)) &= - \int p_{data}(x) \log p_{data}(x) \text{d}x \end{aligned}

The cross-entropy of pdata(x)p_{data}(x) and pθ(x)p_{\theta}(x) is:

H(pdata(x),pθ(x))=pdata(x)logpθ(x)dx\begin{aligned} H(p_{data}(x), p_{\theta}(x)) &= -\int p_{data}(x) \log p_{\theta}(x) \text{d}x \\ \end{aligned}

And H(pdata(x),pθ(x))=H(pdata(x))+KL(pdata(x)pθ(x))H(p_{data}(x), p_{\theta}(x)) = H(p_{data}(x)) + \text{KL}(p_{data}(x) || p_{\theta}(x)).

The entropy of pdata(x)p_{data}(x) is a constant, so minimizing the KL divergence is equivalent to minimizing the cross-entropy of pdata(x)p_{data}(x) and pθ(x)p_{\theta}(x), which is equivalent to maximizing:

maxθpdata(x)logpθ(x)dx.\begin{aligned} \max_{\theta} \int p_{data}(x) \log p_{\theta}(x) \text{d}x. \end{aligned}

We can use stochastic gradient ascent to maximize the above equation. The gradient is:

θpdata(x)logpθ(x)dx=pdata(x)θlogpθ(x)dx=Epθ(hx)pdata(x)[Eθ(x,h)θ]Epθ(x,h)[Eθ(x,h)θ].\begin{aligned} \frac{\partial}{\partial \theta} \int p_{data}(x) \log p_{\theta}(x) \text{d}x &= \int p_{data}(x) \frac{\partial}{\partial \theta} \log p_{\theta}(x) \text{d}x \\ &= \mathbb{E}_{p_{\theta}(h|x)p_{data}(x)}[-\frac{\partial E_{\theta}(x, h)}{\partial \theta}] - \mathbb{E}_{p_{\theta}(x, h)}[-\frac{\partial E_{\theta}(x, h)}{\partial \theta}]. \end{aligned}

We can use Monte Carlo to approximate the above equation.

  • For the first expectation Epθ(hx)pdata(x)[Eθ(x,h)θ]\mathbb{E}_{p_{\theta}(h|x)p_{data}(x)}[-\frac{\partial E_{\theta}(x, h)}{\partial \theta}], we can first sample xx from pdata(x)p_{data}(x) (we don’t know the distribution of real data, but we have training data.), then sample hh from pθ(hx)p_{\theta}(h|x).
  • For the second expectation Epθ(x,h)[Eθ(x,h)θ]\mathbb{E}_{p_{\theta}(x, h)}[-\frac{\partial E_{\theta}(x, h)}{\partial \theta}], we can use finite-step Gibbs sampler.

In this way, we don’t need to compute the partition function Z(θ)Z(\theta). This method is called Contrastive Divergence(CD).

EBMs with Continuous Observable Variables and Discrete Latent Variables: GRBMs

Here, we consider continuous observable variables vv and binary units (latent variables) hh. The energy function is:

Eθ(v,h)=12(vμσ)T(vμσ)(vσ2)WhbTh.E_{\theta}(v, h) = \frac{1}{2} (\frac{v-\mu}{\sigma})^T(\frac{v-\mu}{\sigma}) - (\frac{v}{\sigma^2}) W h - b^T h.

The conditional independence still holds:

p(vh)=N(vWh+μ,diag(σ2))p(hj=1v)=[Sigmoid(WTvσ2+b)]j\begin{aligned} p(v|h) &= \mathcal{N}(v | Wh + \mu, \text{diag}(\sigma^2)) \\ p(h_j = 1|v) &= [\text{Sigmoid}(W^T \frac{v}{\sigma^2} + b)]_j \end{aligned}

Modern EBMs

EBMs with Learnable Energy Functions

For RBMs, we designed the energy function in advance, and it implied conditional independence. But, in general, it’s hard to design the energy function in advance.Thus, we want to learn the energy function Eθ(x)E_{\theta}(x) from data.

One way is to use deep neural networks to parameterize the energy function Eθ(x)E_{\theta}(x). For example, we can use U-Net architecture:

The energy obtained by the energy function is a scalar and the output of the U-Net is a tensor. Thus, we need to design some readout choices to get the scalar energy. For example,

Eθ(x)=xTfθ(x),Eθ(x)=(xfθ(x))2,Eθ(x)=fθ(x)2.\begin{aligned} E_{\theta}(x) &= x^T f_{\theta}(x), \\ E_{\theta}(x) &= (x - f_{\theta}(x))^2, \\ E_{\theta}(x) &= f_{\theta}(x)^2 .\\ \end{aligned}

Empirically, the first choice is better.

Inference: Langevin Monte Carlo

After learning the energy function Eθ(x)E_{\theta}(x), how to sample from

pθ(x)=1Zexp(Eθ(x))p_{\theta}(x) = \frac{1}{Z}\exp(-E_{\theta}(x))

One way is to use Langevin Monte Carlo(LMC). The stochastic differential equation(SDE) of LMC is:

dx=logpθ(x)dt+2dBt\begin{aligned} \text{d}x = \nabla \log p_{\theta} (x) \text{d}t + \sqrt{2} \text{d} B_t \end{aligned}

where BtB_t is a standard Brownian motion. The first term logpθ(x)dt\nabla \log p_{\theta} (x) \text{d}t is called the drift term, which dominates the movement of the particle. The second term 2dBt\sqrt{2} \text{d} B_t is called the diffusion term, which includes the stochasticity of the process.

One can prove Langevin diffusion is irreducible, strong Feller, and aperiodic. Thus, the stationary distribution of the Langevin diffusion is pθ(x)p_{\theta}(x), and we can use Langevin diffusion to sample from pθ(x)p_{\theta}(x).

To turn the Langevin diffusion into a sampling algorithm, we need to discretize the SDE. The simplest way is to use Euler-Maruyama discretization:

dx=logpθ(x)dt+2dBtxt+η=xt+logpθ(xt)(t+ηt)+2(Bt+ηBt)=xt+ηlogpθ(xt)+2ηϵ,ϵN(0,I)\begin{aligned} \text{d}x &= \nabla \log p_{\theta} (x) \text{d}t + \sqrt{2} \text{d} B_t \\ x_{t+\eta} &= x_t + \nabla \log p_{\theta} (x_t) (t+\eta - t) + \sqrt{2} (B_{t+\eta} - B_t) \\ &= x_t + \eta \nabla \log p_{\theta} (x_t) + \sqrt{2 \eta} \epsilon, \quad \epsilon \sim N(0, I) \end{aligned}

where η\eta is the step size.

  • If we ignore the noise term, we are using gradient ascent to maximize the density, this means we are trying to fing the ‘mode’ of xx. The ‘mode’ is the mean of the distribution. In order to generate more samples, we add noise term.

Sampling algorithm:

  • Given initial sample x(0)x^{(0)} and step size η\eta.
    • for t=1,2,...,Tt = 1, 2, ..., T:
      x(t)=x(t1)+ηlogpθ(x(t1))+2ηϵ,ϵN(0,I)x^{(t)} = x^{(t-1)} + \eta \nabla \log p_{\theta} (x^{(t-1)}) + \sqrt{2 \eta} \epsilon, \quad \epsilon \sim N(0, I)
  • Return x(T)x^{(T)}.

In EBMs, the score function logpθ(x)\nabla \log p_{\theta} (x) is the derivative of the energy function logpθ(x)\log p_{\theta} (x) with respect to xx, and

xlogpθ(x)=x(Eθ(x)logZ)=Eθ(x).\nabla_x \log p_{\theta} (x) = \nabla_x (-E_{\theta}(x) - \log Z)= -\nabla E_{\theta}(x).

here, ZZ doesn’t depend on xx.

Noticed that there is another score function logpθ(x)\nabla \log p_{\theta} (x), which is the derivative of the probability distribution logpθ(x)\log p_{\theta} (x) with respect to θ\theta.

Learning: Contrastive Divergence

Similar to RBMs, we can use contrastive divergence to update θ\theta. The gradient is

pdatalogpθ(x)θdx=Epdata(x)[Eθ(x)θ]Epθ(x)[Eθ(x)θ].\int p_{data} \frac{\partial \log p_{\theta}(x)}{\partial \theta} \text{d} x = \mathbb{E}_{p_{data}(x)}[- \frac{\partial E_{\theta}(x)}{\partial \theta}] - \mathbb{E}_{p_{\theta}(x)}[-\frac{\partial E_{\theta}(x)}{\partial \theta}].

For the second expectation, we can use Langevin Monte Carlo sampling to sample from pθ(x)p_{\theta}(x), and then estimate the expectation.

Score Matching

In contrastive divergence, at each training iteration, we use Langevin Monte Carlo to sample from pθ(x)p_{\theta}(x), and then estimate the expectation. However, the Langevin Monte Carlo sampling is not efficient, expecially in high-dimensional space. Thus, we need to train the model without sampling.

Score matching is a method to train the model without sampling. The idea is to minimize the difference between the score function logpθ(x)\nabla \log p_{\theta} (x) and the score function of the data distribution logpdata(x)\nabla \log p_{data} (x).

The (stein) score function is:

sθ(x)=logpθ(x)=Eθ(x),s_{\theta}(x) = \nabla \log p_{\theta} (x) = -\nabla E_{\theta}(x),

which is independent of the partition function Z(θ)Z(\theta) and needs the pdf is differentiable.

Fisher divergence between two distributions p(x)p(x) and q(x)q(x) is:

DF(p(x),q(x))=12Exp(x)[xlogp(x)xlogq(x)22].D_F(p(x), q(x)) = \frac{1}{2} \mathbb{E}_{x \sim p(x)}[||\nabla_x \log p(x) - \nabla_x \log q(x)||_2^2].

Score matching is to minimize the Fisher divergence between pθ(x)p_{\theta}(x) and pdata(x)p_{data}(x):

minθDF(pθ(x),pdata(x))=minθ12Expdata(x)[xlogpdata(x)sθ(x)22]=minθ12Expdata(x)[xlogpdata(x)(xEθ(x))22]\begin{aligned} \min_{\theta} D_F(p_{\theta}(x), p_{data}(x)) &= \min_{\theta} \frac{1}{2} \mathbb{E}_{x \sim p_{data}(x)}[||\nabla_x \log p_{data}(x) - s_{\theta}(x)||_2^2] \\ &= \min_{\theta} \frac{1}{2} \mathbb{E}_{x \sim p_{data}(x)}[||\nabla_x \log p_{data}(x) - (-\nabla_x E_{\theta}(x) )||_2^2] \end{aligned}

Since we don’t know the real data distribution, we need to deal with xlogpdata(x)\nabla_x \log p_{data}(x). Assume that $ p_{data}(x)$ decays to 0 sufficiently rapidly as x±x \rightarrow \pm \infty, one can derive the following equation:

12Expdata(x)[xlogpdata(x)xlogpθ(x)22]=Expdata(x)[12xlogpθ(x)22+tr(x2logpθ(x))]+const,\begin{aligned} &\frac{1}{2} \mathbb{E}_{x \sim p_{data}(x)}[||\nabla_x \log p_{data}(x) - \nabla_x \log p_{\theta}(x)||_2^2] \\ =&\mathbb{E}_{x \sim p_{data}(x)}[\frac{1}{2} ||\nabla_x \log p_{\theta}(x)||^2_2 + \text{tr}(\nabla^2_x \log p_{\theta}(x))] + \text{const}, \end{aligned}

where tr(x2logpθ(x))tr(\nabla^2_x \log p_{\theta}(x)) is the trace of the Hessian matrix of logpθ(x)\log p_{\theta}(x). Therefore, we can use monte carlo to estimate the above loss:

Expdata(x)[12xlogpθ(x)22+tr(x2logpθ(x))],=1ni=1n[12xlogpθ(xi)22+tr(x2logpθ(xi))]=1ni=1n[12Eθ(xi)22+tr(x2logpθ(xi))]\begin{aligned} &\mathbb{E}_{x \sim p_{data}(x)}[\frac{1}{2} ||\nabla_x \log p_{\theta}(x)||^2_2 + \text{tr}(\nabla^2_x \log p_{\theta}(x))], \\ =& \frac{1}{n} \sum_{i=1}^n [\frac{1}{2} ||\nabla_x \log p_{\theta}(x_i)||^2_2 + \text{tr}(\nabla^2_x \log p_{\theta}(x_i))] \\ =& \frac{1}{n} \sum_{i=1}^n [\frac{1}{2} ||\nabla E_{\theta}(x_i)||^2_2 + \text{tr}(\nabla^2_x \log p_{\theta}(x_i))] \\ \end{aligned}

Then, we can use stochastic gradient descent to minimize the above loss.

Note: computing the trace of the Hessian matrix tr(x2logpθ(x))\text{tr}(\nabla^2_x \log p_{\theta}(x)) is expensive.

Conclusions:

  • we have used two distances for training EBMs:
    • KL divergence, which is equal to maximum likelihood. (contrastive divergence).
    • Fisher divergence, which is equal to score matching.
  • Energy-based models are very felxible probabilistic models with intracable partition functions.
  • Sampling is hard and requires MCMC.
  • Computing the likelihood is hard.
  • Comparing the likelihood/probability of two different points is tractable.
  • Contrastive divergence is a good approximation to maximum likelihood. But, it needs sampling for each iteration.
  • Sampling free methods: score matching, noise contrastive estimation, adversarial optimization, etc.

Score-Based Models

How to represent probability distribution function p(x)p(x) in different models:

  • GAN: min-max loss

  • Autoregressive models: pθ(x)=i=1dpθ(xix<i)p_{\theta}(x) = \prod_{i=1}^{d} p_{\theta}(x_i | x_{<i})

  • Flow-based models: pθ(x)=p(z)det(Jfθ(x))p_{\theta}(x) = p(z) |\det(J_{f_{\theta}}(x))|, z=fθ(x)z = f_{\theta}(x).

  • VAE: use ELBO obj and latent variables

  • EBMs: pθ(x)=1Zexp(Eθ(x))p_{\theta}(x) = \frac{1}{Z}\exp(-E_{\theta}(x))

Pros: except for GAN, these models are maximizing the likelihood.

Cons: They need special atchitectures or surrogate losses.

Remember that the score function is:

sθ(x)=logpθ(x).s_{\theta}(x) = \nabla \log p_{\theta} (x).

As shown in the following figure, score function is the gradient of the log probability function logpθ(x)\log p_{\theta}(x) and the direction of the score function is the vector to the mode of the distribution. This means that the score function directly models the vector field of gradients.

Score matching is not limited to EBMs. We can use score matching to train other models, such as autoregressive models, flow-based models, etc.

We want to train a score-based model sθs_{\theta} to estimate the score xlogpdata(x)\nabla_{x} \log p_{data}(x), we use the average Euclidean distance between the score function sθ(x)s_{\theta}(x) and the score xlogpdata(x)\nabla_{x} \log p_{data}(x) over the whole space as the loss function:

12Expdata(x)[sθ(x)xlogpdata(x)22],(Fisher divergence)\begin{aligned} \frac{1}{2} \mathbb{E}_{x \sim p_{data}(x)}[||s_{\theta}(x) - \nabla_{x} \log p_{data}(x)||_2^2], (\text{Fisher divergence}) \end{aligned}

which is equal to minimize:

Expdata(x)[12sθ(x)22+tr(xsθ(x))],(Score matching)\begin{aligned} \mathbb{E}_{x \sim p_{data}(x)}[\frac{1}{2} ||s_{\theta}(x) ||^2_2 + \text{tr}(\nabla_x s_{\theta}(x))], (\text{Score matching}) \end{aligned}

We need to compute the value of the score function sθ(x)s_{\theta}(x) and the trace of the Jacobian matrix tr(xsθ(x))\text{tr}(\nabla_x s_{\theta}(x)). Thus, the score model must be efficient to evaluate. Since the score models is not scalable, computing the trace of the Jacobian matrix tr(xsθ(x))\text{tr}(\nabla_x s_{\theta}(x)) in the backpropagation is order of O(d)O(d), where dd is the dimension of xx. We need to find an efficient way to train the score model.

Denoising Score Matching

Consider the perturbed distribution:

qσ(x~x)=N(x~x,σ2I),qσ(x~)=p(x)qσ(x~x)dx.q_{\sigma}(\tilde{x}|x) = \mathcal{N}(\tilde{x}|x, \sigma^2 I), \qquad q_{\sigma}(\tilde{x}) = \int p(x)q_{\sigma}(\tilde{x}|x) \text{d}x.

Instead of estimating xlogqθ(x)\nabla_x \log q_{\theta}(x), we can estimate x~logqσ(x~)\nabla_{\tilde{x}} \log q_{\sigma}(\tilde{x}). It’s easier to estimate and when the noise level is small, qσ(x~)p(x~)q_{\sigma}(\tilde{x}) \approx p(\tilde{x}).

Therefore, we can use denoising score matching to match the score of a noise-perturbed distribution:

12Ex~qσ[x~logqσ(x~)sθ(x~)22]=12Expdata(x),x~qσ(x~x)[sθ(x~)x~logqσ(x~x)22]+const.\begin{aligned} &\frac{1}{2}E_{\tilde{x} \sim q_{\sigma}}[\|\nabla_{\tilde{x}}\log q_{\sigma}(\tilde{x}) - s_{\theta}(\tilde{x})\|_2^2] \\ =& \frac{1}{2} E_{x \sim p_{data}(x), \tilde{x} \sim q_{\sigma}(\tilde{x}|x)}[\|s_{\theta}(\tilde{x})- \nabla_{\tilde{x}} \log q_{\sigma}(\tilde{x}|x)\|_2^2] + \text{const}. \end{aligned}

In this form, we don’t need to compute the trace of the Jacobian matrix tr(xsθ(x))\text{tr}(\nabla_x s_{\theta}(x)). Since qσ(x~x)=N(x~x,σ2I)q_{\sigma}(\tilde{x}|x) = \mathcal{N}(\tilde{x}|x, \sigma^2 I), x~logqσ(x~x)=x~xσ2\nabla_{\tilde{x}} \log q_{\sigma}(\tilde{x}|x) = -\frac{\tilde{x} - x}{\sigma^2}. It’s more efficient to optimize for high dimensional data.

Con: notice that, we use score function to estimate the noise-perturbed distribution, which means we cannot estimate the score of the clean data.

Denoising

Denoising: after training a score model, we can use langevin MC sampling to get noise samples from qσ(x~)q_{\sigma}(\tilde{x}). According to Tweedie’s formula:

Exp(xx~)[x]=x~+σ2xlogqσ(x~)x~+σ2sθ(x~).E_{x \sim p(x|\tilde{x})}[x] = \tilde{x} + \sigma^2 \nabla_x \log q_{\sigma}(\tilde{x}) \approx \tilde{x} + \sigma^2 s_{\theta}(\tilde{x}).

[remember sθ(x~)=x~logqσ(x~)s_{\theta}(\tilde{x}) = \nabla_{\tilde{x}}\log q_{\sigma}(\tilde{x})]

Langevin MCMC: from scores to samples:

  • Given initial sample x(0)x^{(0)}.
  • for t=1,2,...,Tt = 1, 2, ..., T:
  • z(t)N(0,I)\qquad z^{(t)} \sim \mathcal{N}(0, I)
  • x~(t)=x~(t1)+ϵ2sθ(x~)+ϵzt\qquad\tilde{x}^{(t)} = \tilde{x}^{(t-1)} + \frac{\epsilon}{2} \nabla s_{\theta}(\tilde{x}) + \sqrt{\epsilon} z^t
  • Return x~(T)\tilde{x}^{(T)}.

If noise ϵ0\epsilon \to 0 and TT \to \infty, x~(T)pdata(x)\tilde{x}^{(T)} \sim p_{data}(x).

Multi-scale Noise Perturbation

When using the denoising score matching, we only use the observed samples to estimate the scores. Thus, the estimated scores in low density regions are not accurate. Moreover, langevin MCMC converges very slowly.

One way to improve the accuracy of the estimated scores in low density regions is to increase the noise level σ\sigma. As shown in the following figure, the estimated scores in low density regions are more accurate when the noise level σ\sigma is large.

  • High noise provides useful directional information for Langevin dynamics.
  • But perturbed density no longer approximates the true data density.

Multi-scale noise perturbation: perturb data with different levels of noise simulteanously, and aggregate the information from all noise levels.

If the noise levle σ\sigma is large, the perturbed data quality is worse, but the estimated score is more close to the perturbed scores. It’s a trade-off between the data quality and estimated score accuracy.

When the noise is small, the perturbed distribution is close to the original data distribution, but the estimation errors in low density regions are still high.

When we add larger and larger noise, the estimation score is close to the perturbed data score, but the perturbed data score differs from the original data score.

In order to achieve the best data quality and estimation accuracy at the same time, we should consider all perturbations jointly instead of focusing on only one perturbation.

Training procedure:
Assume we have LL noise levels σ1,σ2,...,σL\sigma_1, \sigma_2, ..., \sigma_L and corresponding perturbed data distributions qσ1(x~),qσ2(x~),...,qσL(x~)q_{\sigma_1}(\tilde{x}), q_{\sigma_2}(\tilde{x}), ..., q_{\sigma_L}(\tilde{x}). For each perturbed data distribution, we can easily sample from them, and use score estimation to estimate the corresponding scores. However, this method requires a large number of separate score models to be learned independently, which is very costly, exspecially when the number of noise levels LL is large.

One way is to train a single conditional score network for all noise levels. The score model will take σ\sigma as an input. This model is named the Noise Conditional Score Network.

The loss function is a weighted combination of denoising score matching loss with different noise levels:

1Ll=1Lλ(σi)Ex~qσi[x~logqσi(x~)sθ(x~,σi)22]=1Ll=1Lλ(σi)Expdata(x),zN(0,I)[sθ(x+σiz,σi)+zσi22]+const.\begin{aligned} &\frac{1}{L}\sum_{l=1}^L \lambda(\sigma_i) E_{\tilde{x} \sim q_{\sigma_i}}[\|\nabla_{\tilde{x}}\log q_{\sigma_i}(\tilde{x}) - s_{\theta}(\tilde{x}, \sigma_i)\|_2^2] \\ =& \frac{1}{L}\sum_{l=1}^L \lambda(\sigma_i) E_{x \sim p_{data}(x), z \sim N(0, I)}[\|s_{\theta}(x+\sigma_i z, \sigma_i) + \frac{z}{\sigma_i}\|_2^2] + \text{const}. \end{aligned}

[compute the loss in parallel?]

About the weighting function λ(σi)\lambda(\sigma_i), we can set λ(σi)=σi2\lambda(\sigma_i) = \sigma_i^2.

About choosing the noise level σi\sigma_i:

  • The largest noise level σ1\sigma_1 approximates the maximum pairwise distance between data points.

  • The smallest noise level σL\sigma_L should be small enough so that the noise in final samples is negligible.

  • Adjacent noise scales should have sufficient overlap to facilitate transitioning across noise scales in annealed Langevin dynamics. One way is to use geometric sequence:

    σ1σ2=σ2σ3=...=σL1σL,σ1>σ2>...>σL.\frac{\sigma_1}{\sigma_2} = \frac{\sigma_2}{\sigma_3} = ... = \frac{\sigma_{L-1}}{\sigma_L}, \quad \sigma_1 > \sigma_2 > ... > \sigma_L.

Sampling procedure: we use annealed Langevin dynamics to sample from the noise conditional score network using scores of different noise levels.

We first use Langevin dynamics to sample from the most perturbed data distribution. Then, the resulting samples will be used as initial samples for sampling from the next noise level. We continue in this fashion and finally use Langevin dynamics to sample from the least perturbed data distribution.

sθs_{\theta} is shared at each iteration, since we only have one network.

Conclusions:

  • Gradients of distributions (scores) can be estimated easily
  • Flexible architecture choices — no need to be normalized/invertible
  • Stable training — no minimax optimization
  • Better or comparable sample quality to GANs
  • Exact likelihood computation

Introduction about diffusion model

Reference: What are Diffusion Models?

DDPM: Denoising Diffusion Probabilistic Models

Diffusion model is a generative model:

  • diffusion process: add noise to a real image, finally we get a noise image.
  • reverse process: from noise image to generate real image.

(1) training phase
from a real image datasets —> through diffusion process —> noise images —> through reverse process —> real images

(2) inference phase

sampling noise images from a gaussian distribution, then use the pre-trained reverse process to generate images.

Diffusion process

add noise to a clean image X0X_0 and then we get noisy image X1,X2,...,XTX_1, X_2, ..., X_T.

Now, we focus on the process from image Xt1X_{t-1} to XtX_t.

Xt=1βtXt1+βtZt,ZtN(0,I)X_t = \sqrt {1 - \beta_t}X_{t-1} + \sqrt {\beta_t} Z_{t}, \quad Z_t \sim N(0, I)

Remark: the noise scale βt\beta_t will be increased gradually. $ \beta_t$ increases from 10410^{-4} to 21022*10^{-2} linearly. T=2000T = 2000.

Let 1βt=αt1 - \beta_t = \alpha_t, then we have

Xt=αtXt1+1αtZtXt1=αt1Xt2+1αt1Zt1X_t = \sqrt{\alpha_t}X_{t-1} + \sqrt{1 - \alpha_t} Z_t \\ X_{t-1} = \sqrt{\alpha_{t-1}}X_{t-2} + \sqrt{1 - \alpha_{t-1}} Z_{t-1} \\

Combine these two equlities, we have

Xt=αtXt1+1αtZt=αt(αt1Xt2+1αt1Zt1)+1αtZt=αtαt1Xt2+αt(1αt1)Zt1+1αtZt=αtαt1Xt2++1αtαt1Z,ZN(0,I)=...=αtαt1...α1X0++1αtαt1...α1Z=αˉtX0++1αˉtZ,αˉt=i=1tαi.\begin{aligned} X_t &= \sqrt{\alpha_t}X_{t-1} + \sqrt{1 - \alpha_t} Z_t \\ &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}X_{t-2} + \sqrt{1 - \alpha_{t-1}} Z_{t-1})+ \sqrt{1 - \alpha_t} Z_t \\ &= \sqrt{\alpha_t \alpha_{t-1}}X_{t-2} + \sqrt{\alpha_t(1 - \alpha_{t-1})} Z_{t-1}+ \sqrt{1 - \alpha_t} Z_t \\ & = \sqrt{\alpha_t \alpha_{t-1}}X_{t-2} + + \sqrt{1 - \alpha_t \alpha_{t-1}} Z, \quad Z \sim N(0, I) \\ &= ... \\ &= \sqrt{\alpha_t \alpha_{t-1}...\alpha_1}X_{0} + + \sqrt{1 - \alpha_t \alpha_{t-1}... \alpha_{1}} Z \\ &= \sqrt{\bar{\alpha}_t}X_{0} + + \sqrt{1 - \bar{\alpha}_t} Z, \qquad \bar{\alpha}_t = \prod_{i = 1}^{t}\alpha_i. \end{aligned}

The relation between DDPM and SDE

DDPM: xi=1βixi1+βizi1x_i = \sqrt{1 - \beta_i} x_{i-1} + \sqrt{\beta_i}z_{i-1}

SDE: dx=f(x,t)dt+g(x,t)dw\text{d}x = f(x, t)\text{d}t + g(x, t)\text{d}w,

the solution is

xt=x0+0tf(x,t)dt+0tg(x,t)dwx_t = x_0 + \int_0^t f(x, t)dt + \int_0^t g(x, t)dw

where ww is a Brownian motion.

How to get the mean and convariance of xtx_t ?

We can use FPK equation. See more on FPK equation.

The problem is the mean and covariance of xtx_t is dependent on the expectation of p(x(t))p(x(t)), which we don’t know.

DDPM / Variance preserving SDE

xi=1βixi1+βizi1x_i = \sqrt{1 - \beta_i} x_{i-1} + \sqrt{\beta_i}z_{i-1}

Let β(t=i/N)=Nβi,X(t=i/N)=xi,Z(i/N)=zi,Δt=1/N\beta(t=i/N) = N \beta_i, X(t = i/N) = x_i, Z(i/N) = z_i, \Delta t = 1/N.

Then we have

X(t+Δt)=1β(t+Δt)ΔtX(t)+β(t+Δt)ΔtZ(t)X(t)12β(t)ΔtX(t)+β(t)ΔtZ(t)\begin{aligned} X(t + \Delta t) &= \sqrt{1 - \beta(t+\Delta t)\Delta t}X(t) + \sqrt{\beta(t+\Delta t)\Delta t}Z(t) \\ & \approx X(t) - \frac{1}{2} \beta(t) \Delta t X(t) + \sqrt{\beta(t)\Delta t}Z(t) \\ \end{aligned}

dx=12β(t)x(t)dt+β(t)dw,\text{d} x = - \frac{1}{2} \beta(t) x(t) dt + \sqrt{\beta(t)} \text{d} w,

where ww is a Brownian motion.

The mean and covariance of xtx_t is

E[x(t)]=E[x(0)]e120tβ(s)dscov[x(t)]=II×e120tβ(s)dsI.\begin{aligned} \mathbb{E}[x(t)] &= \mathbb{E}[x(0)]e^{-\frac{1}{2}\int_0^t \beta(s)ds} \\ \text{cov}[x(t)] &= I - I \times e^{-\frac{1}{2}\int_0^t \beta(s)ds} \leq I. \end{aligned}

SCORE-based model / Variance Exploding SDE

xi=xi1+σi2σi12zi1x_i = x_{i-1} + \sqrt{\sigma_{i}^2 - \sigma^2_{i-1}}z_{i-1}

Let X(t=i/N)=xiX(t = i/N) = x_i, σ(t=i/N)=σi,Z(t=1/N)=zi,Δt=1/N\sigma(t = i/N) = \sigma_i, Z(t = 1/N) = z_i, \Delta t = 1/N.

X(t+Δt)=X(t)+σ(t+Δt)2σ(t)2Z(t)X(t)+Δσ(t)2ΔtΔtZ(t)\begin{aligned} X(t + \Delta t) &= X(t) + \sqrt{\sigma(t+\Delta t)^2 - \sigma(t)^2}Z(t) \\ & \approx X(t) + \sqrt{\frac{\Delta \sigma(t)^2}{\Delta t} \Delta t}Z(t) \\ \end{aligned}

where ww is a Brownian motion.

dx=dσ(t)2dtdwd x = \sqrt{\frac{d \sigma(t)^2}{dt}} d w

The mean and covariance of xtx_t is

E[x(t)]=E[x(0)]cov[x(t)]=σ2(t)I.\begin{aligned} \mathbb{E}[x(t)] &= \mathbb{E}[x(0)] \\ \text{cov}[x(t)] &= \sigma^2(t) I. \end{aligned}

Here, σ2(t)\sigma^2(t) is non-decreasing variance. Thus, the variance will be exploded.

Evaluating Generative Models

Model families

  • Probability density/mass functions

    • Autoregressive models: pθ(x)=i=1dpθ(xix<i)p_{\theta}(x) = \prod_{i=1}^{d} p_{\theta}(x_i | x_{<i}).
    • Normalizing flow models: pθ(x)=p(z)det(Jfθ(x))p_{\theta}(x) = p(z) |\det(J_{f_{\theta}}(x))|, z=fθ(x)z = f_{\theta}(x).
    • Latent variable models(VAEs): pθ(x)=pθ(xz)p(z)dzp_{\theta}(x) = \int p_{\theta}(x|z)p(z) \text{d}z.
    • Energy-based models: pθ(x)=1Zexp(Eθ(x))p_{\theta}(x) = \frac{1}{Z}\exp(-E_{\theta}(x)).
  • Sample generation processes

    • GANs: x=Gθ(z)x = G_{\theta}(z), zp(z)z \sim p(z).
  • Score functions

    • Score-based models: sθ(x)=xlogpθ(x)s_{\theta}(x) = \nabla_x \log p_{\theta} (x).

Distances of probability distributions

  • KL divergence(maximum likelihood): DKL(pq)=p(x)logp(x)q(x)dxD_{KL}(p||q) = \int p(x) \log \frac{p(x)}{q(x)} \text{d}x

    • Autoregressive models.
    • Normalizing flow models.
    • ElBO in VAEs.
    • Contrastive divergence in EBMs.
  • f-divergences, Wasserstein distances

    • GANs (f-GANs, WGANs)
  • Fisher divergence(score matching): denoising score matching, sliced score matching

    • Score-based models
    • Energy-based models
  • Noise-contrastive estimation

    • Energy-based models

Evaluation - Density estimation

We can use likelihood as a matric for density estimation:

  • Split dataset into train, validation, test sets.
  • Learn model pθ(x)p_{\theta}(x) using the train set.
  • Tune hyperparameters using the validation set.
  • Evaluate likelihood on the test set: pdata[logpθ(x)]\mathbb{p_{data}}[\log p_{\theta}(x)].

However, the likelihood is intractable for many models. Not all models have tractable likelihoods. For example, GANs, VAEs, EBMs, etc.

For VAEs, we can compare the ELBO to log-likelihood. But, how about GANs and EBMs?

In general, unbiased estimation of probability density functions from samples is impossible. We can use approximation methods, such as kernel density estimation.

Kernel density estimation:

Given an intractable density model pθ(x)p_{\theta}(x) and limited samples S={xi}i=1nS = \{x_i\}_{i=1}^n, we can use kernel density estimation to estimate the density function pθ(x)p_{\theta}(x). For a new data point xx, we can estimate the density of xx as:

p^θ(x)=1niSK(xxiσ),\hat{p}_{\theta}(x) = \frac{1}{n} \sum_{i \in S} K(\frac{x - x_i}{\sigma}),

where KK is a kernel function, σ\sigma is the bandwidth.

  • Gaussian kernel: K(x)=12πexp(12x2)K(x) = \frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}x^2).
  • A kernel is a function that satisfies the following properties:
    • K(x)0K(x) \geq 0.
    • K(x)dx=1\int K(x) \text{d}x = 1.
    • K(x)=K(x)K(x) = K(-x).
  • Bandwidth σ\sigma controls the smoothness of the density estimate.
    • Small σ\sigma: undersmoothed
    • Large σ\sigma: oversmoothed
    • σ\sigma is a hyperparameter. We can use cross-validation to choose σ\sigma.
  • KDE is very unreliable in higher dimensions.

Importance sampling for latent variable models

For likelihood p(x)p(x), we can use likehood weighting to estimate the likelihood:

p(x)=Ep(z)[p(xz)].p(x)= \mathbb{E}_{p(z)}[p(x|z)].

Monte Carlo sampling is one way to estimate the expectation. However, if p(z)p(z) is far from p(zx)p(z|x), the variance of the likehood weighting is very large. For example, the probability of p(z)p(z) is very small, but the probability of p(zx)p(z|x) is very large at some regions. Thus, we need another distribution q(z)q(z), which is close to p(zx)p(z|x), to estimate the expectation.

Importance sampling is another way to estimate the expectation. The idea is to sample from a proposal distribution q(z)q(z), then

Ep(z)[p(xz)]=p(z)p(xz)dz=q(z)p(z)q(z)p(xz)dz=Eq(z)[p(z)q(z)p(xz)].\begin{aligned} \mathbb{E}_{p(z)}[p(x|z)] = \int p(z) p(x|z) \text{d}z = \int q(z) \frac{p(z)}{q(z)} p(x|z) \text{d}z = \mathbb{E}_{q(z)}[\frac{p(z)}{q(z)} p(x|z)]. \end{aligned}

Then, we can use Monte Carlo sampling to estimate the expectation.

Pros:

  • Still unbiased.
  • We can choose q(z)q(z) to have lower variance. One can prove that q(z)q(z) should be high where p(z)p(xz)|p(z)p(x|z)| is high.

Cons:

  • We need to choose a good proposal distribution q(z)q(z).
  • It’s unreliable in high dimensions.

When to use importance sampling?

  • p(z)p(z) is difficult to sample from.
  • We can evaluate p(z)p(z).
  • q(x)q(x) is easy to evaluate and sample from.
  • We can choose q(z)q(z) to be high where p(z)p(xz)|p(z)p(x|z)| is high.

Annealed importance sampling is another method to estimate the likelihood.

Sample quality

Inception score (IS)

Inception score is a metric for evaluating the quality of generated images. The idea is to use a pretrained Inception network to classify the generated images.

Assumption 1: we are evaluating sample quality for generative models trained on labeled datasets.

Assumption 2: We have a good probabilistic classifier c(yx)c(y|x) for predicting the label yy of any point xx.

(A classifier can be trained on a large dataset, such as ImageNet.)

We want a good generative model to satisfy two properties:

  • Sharpness: the generated images should be sharp.

    S=exp(Exp[c(yx)logc(yx)dy])S = \exp(E_{x \sim p}[\int c(y | x)\log c(y|x) \text{d}y])

    High sharpness implies classifier is confident in making predictions for generated images, and c(yx)c(y|x) has low entropy.

  • Diversity: the generated images should be diverse.

    D=exp(Exp[c(yx)logc(y)dy]),c(y)=Exp[c(yx)]D = \exp(E_{x \sim p}[\int c(y|x) \log c(y) \text{d} y]), \qquad c(y) = E_{x \sim p} [c(y|x)]

    High diversity implies the generated images are diverse, and c(y)c(y) has high entropy.

Inception score combines these two properties:

IS=D×S.IS = D \times S.

Higher IS implies better sample quality.

Frechet inception distance (FID)

Inception score only considers the samples from pθ(x)p_{\theta}(x), but ignores the real data distribution pdata(x)p_{data}(x).

FID is a metric for evaluating the quality of generated images. The idea is to use a pretrained Inception network to extract features from the generated images and real images. Then, we can compute the Frechet distance between the two feature distributions.

  • Let G\mathcal{G} be the generated samples and T\mathcal{T} be the test dataset.
  • Compute feature representations FGF_{\mathcal{G}} and FTF_{\mathcal{T}}.
  • Fit a multivariate Gaussian to FGF_{\mathcal{G}} and FTF_{\mathcal{T}}. Let μG\mu_{\mathcal{G}} and μT\mu_{\mathcal{T}} be the mean vectors and ΣG\Sigma_{\mathcal{G}} and ΣT\Sigma_{\mathcal{T}} be the covariance matrices.
  • FID is defined as the Wasserstein-2 distance between the two Gaussians:

    FID(G,T)=μGμT22+tr(ΣG+ΣT2(ΣGΣT)1/2).FID(\mathcal{G}, \mathcal{T}) = ||\mu_{\mathcal{G}} - \mu_{\mathcal{T}}||_2^2 + \text{tr}(\Sigma_{\mathcal{G}} + \Sigma_{\mathcal{T}} - 2(\Sigma_{\mathcal{G}}\Sigma_{\mathcal{T}})^{1/2}).

Lower FID implies better sample quality.

Kernel inception distance (KID)

Maximum mean discrepancy (MMD) is a two-sample test statistic that measures the distance between two distributions by computing differences in their moments. Using the kernel trick, we can compute the MMD between two distributions:

MMD(p,q)=Ex,xp[K(x,x)]+Ey,yq[K(y,y)]2Exp,yq[K(x,y)].MMD(p, q) = E_{x, x' \sim p} [K(x, x')] + E_{y, y' \sim q} [K(y, y')] - 2E_{x \sim p, y \sim q} [K(x, y)].

Kernel inception distance (KID) is a metric for evaluating the quality of generated images. The idea is to use a pretrained Inception network to extract features from the generated images and real images. Then, we can compute the MMD between the two feature distributions.

FID VS. KID:

  • FID can only be positive, and it’s biased, KID is unbiased.
  • The computation time of FID is O(n)O(n), but the computation time of KID is O(n2)O(n^2).

Evaluation - latent representations

What is a good latent representation? For downstream tasks, we can evaluate the quality of latent representations by evaluating the performance of the downstream tasks, such as reconstruction, classification, etc.

For unsupervised learning, there is no one-size-fits-all metric for evaluating the quality of latent representations. We can use the following metrics to evaluate the quality of latent representations:

clustering

Representations that can be grouped into clusters are potentially useful. For example, the representations of a generated model for MNIST can be grouped into different clusters, where each cluster corresponds to one or more digits.

For labelled datasets, there are many evaluation metrics. The lables are only used for evaluation, not for clustering.

from sklearn.metrics.cluster import completeness_score, homogeneity_score, v_measure_score

compression or reconstruction

Latent representations can be evaluated based on the maximum compression they can achieve without significant loss in reconstruction quality.

Some metrics: Mean Squared Error (MSE), Peak signal-to-noise ratio (PSNR), Structural similarity (SSIM), etc.

disentanglement

We want representations that disentangle independent and interpretable factors of variation in the observed data.

Some quantitative metrics:

  • Beta-VAE metric: accuracy of a linear classifier that predicts a fixed factor of variation.
  • Factor-VAE, Mutual Information Gap (MIG), SAP score, DCI disentanglement, Modularity, etc.

Reference

  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.
  • Copyrights © 2021-2026 Xue Yu
  • Visitors: | Views: