生成模型 (0)

生成模型 (0) Overview of Deep Generative Modeling
生成模型 (1.1) 变分推断
生成模型 (1.2) Variational Auto-Encoder
生成模型 (1.3) Denoising Diffusion Probabilistic Model
生成模型 (2.1) Energy-based Model
生成模型 (3.1) Flow-based Method
生成模型 (3.2) Flow Model
生成模型 (3.3) Flow Matching

引言

从今天开始和各位一起从0开始深入浅出地学习现代生成模型的数学基础和应用实践。生成模型是我非常感兴趣的一个研究方向，奈何本人本科数学基础太差，研究生入门时发现自己完全看不懂，碍于论文进度也就放弃了。现在快毕业了，没有发论文的压力，也就能够有时间慢慢钻研和理解。

这个系列是基于Chieh-Hsin Lai和Yang Song大神近期公开的教材 The Principles of Diffusion Models: From Origins to Advances。这本教材里面包含了生成式建模的基础和进阶技巧，我的计划是先把基础部分学透彻，如果还有时间和精力再继续学习进阶技巧。

在本文中，我们会对整个生成式建模方法做一个总体的概览，引入生成式建模的定义和目标，并简要介绍一些经典的方法，给读者一个高层次的概念。

在后续的文章中，我们将分别从三个角度详细介绍生成模型：

变分视角：主要介绍VAE和DDPM
Score-based视角：主要介绍能量模型和NCSN
流视角：主要介绍标准化流和流匹配

后面我们还将介绍Yang Song大神的主要工作成果：Score SDE。这是把生成式建模扩展到连续时间上的随机微分方程的模型，可以看作是上面三个视角的统一和延伸。

最后，我们将借助Fokker-Planck Equation，从一个统一的视角来看待生成式建模的问题。

一、什么是生成式建模

生成式建模 (Generative Modeling) 是一种用于学习现实世界中的高维数据（如图像、文本、音频等）的概率分布的方法。

在深度生成式建模 (Deep Generative Modeling, DGM) 中，我们希望用一个神经网络 $p_\phi$ 来学习数据的分布 $p_{data}$：

在训练阶段，我们希望最小化 $p_\phi$ 和 $p_{data}$ 之间的距离，以此来训练模型参数 $\phi$；
在推理阶段，我们从 $p_\phi$ 中进行采样，就能生成一个新的样本 $x\sim p_\phi$。

DGM的目标主要有两个方面：

真实性：生成的新样本和真实样本之间应当难以区分
可控性：我们可以对生成过程进行细粒度的控制，并具有良好的可解释性

假设我们有一个有限数据集，其中每个样本 $x$ 都是服从某个复杂数据分布 $p_{data}$ 的独立同分布 (i.i.d) 样本。生成式建模的目标就是从这个有限数据集中学习一个可解 (tractable) 的概率分布，使其足够接近真实分布 $p_{data}$，以此来生成一些新的、逼真的样本。

在深度生成式建模中，我们利用一个深度神经网络，来参数化一个模型分布 $p_\phi$，其中 $\phi$ 是网络中的可学习参数。我们训练的目标是找到最优的参数 $\phi^*$，使得模型分布 $p_\phi$ 和数据分布 $p_{data}$ 足够接近。即：

\[\begin{equation} \phi^*=\arg\min_\phi \mathcal{D}(p_{data},p_\phi) \end{equation}\]

其中，$\mathcal{D}$ 是度量两个分布之间距离的函数。

1.1. KL散度

一个最常用的距离度量是（前向）KL散度：

\[\begin{equation} \begin{aligned} \mathcal{D}_{KL}(p_{data}\|p_\phi) &:=\int p_{data}(x)\log\frac{p_{data}(x)}{p_\phi(x)}\mathrm{d}x\\ &=\mathbb{E}_{x\sim p_{data}}[\log p_{data}(x)-\log p_\phi(x)] \end{aligned} \end{equation}\]

KL散度的一个重要性质是：最小化KL散度会鼓励模型分布 $p_\phi$ 覆盖真实分布 $p_{data}$ 的整个支撑集，这被称为质量覆盖 (mass covering) 属性。从公式 $(2)$ 可以看出，如果存在某个样本 $x$ 满足 $p_{data}(x)>0$ 但 $p_\phi(x)=0$，此时KL散度会变为无穷大。

由于真实分布 $p_{data}$ 往往是难解的，因此我们可以对KL散度进行如下变形：

\[\begin{equation} \begin{aligned} \mathcal{D}_{KL}(p_{data}\|p_\phi) &=\mathbb{E}_{x\sim p_{data}}[\log p_{data}(x)-\log p_\phi(x)]\\ &=-\mathbb{E}_{x\sim p_{data}}[\log p_\phi(x)] - \mathcal{H}(p_{data}) \\ \end{aligned} \end{equation}\]

其中，$\mathcal{H}(p_{data})=-\mathbb{E}{x\sim p{data}}[\log p_{data}(x)]$ 表示真实分布的香农熵。对于一个固定的数据分布来说，其香农熵可以看作一个常量。

我们将上式代入公式 $(1)$ 的优化目标中，可以得到：

\[\begin{equation} \begin{aligned} \phi^*&=\arg\min_\phi \mathcal{D}_{KL}(p_{data}\|p_\phi)\\ &=\arg\min_\phi-\mathbb{E}_{x\sim p_{data}}[\log p_\phi(x)] - \mathcal{H}(p_{data}) \\ &=\arg\max_\phi\mathbb{E}_{x\sim p_{data}}[\log p_\phi(x)] \end{aligned} \end{equation}\]

上面的式子告诉我们，最小化KL散度实际上等价于最大化模型分布的对数似然函数。

1.2. Fisher散度

在 score-based 分布建模方法中，常用Fisher散度来度量两个分布之间的距离：

\[\begin{equation} \begin{aligned} \mathcal{D}_{F}(p\|q) &:=\mathbb{E}_{x\sim p}\left[\|\nabla_x\log p(x)-\nabla_x\log q(x)\|_2^2\right] \end{aligned} \end{equation}\]

其中，$\nabla_x\log p(x)$ 可以看作一个指向分布 $p$ 中高概率区域的一个向量场，这也被称为 score function。

从直观上来说，$\mathcal{D}_{F}(p|q)=0$ 当且仅当 $p=q$ 几乎处处成立。

1.3. F散度

我们将KL散度进行一定的推广，能够得到下面这一类的距离度量：

\[\begin{equation} \begin{aligned} \mathcal{D}_{f}(p\|q) &:=\int q(x)f\left(\frac{p(x)}{q(x)}\right)\mathrm{d}x \end{aligned} \end{equation}\]

其中，$f:\mathbb{R}_+\mapsto\mathbb{R}$ 是任意凸函数。当 $f$ 取某些特例时，我们就能够得到一些常见的散度公式：

令 $f(u)=u\log u$，此时 $\mathcal{D}f=\mathcal{D}{KL}$；
令 $f(u)=\frac{1}{2}\left[ u\log u-(u+1)\log\frac{1+u}{2} \right]$，此时 $\mathcal{D}f=\mathcal{D}{JS}$；
令 $f(u)=\frac{1}{2}\lvert u-1 \rvert$，此时 $\mathcal{D}f=\mathcal{D}{TV}$；

这里引入了两个新的散度，我们简要介绍一下。

\[\begin{equation} \begin{aligned} \mathcal{D}_{JS}(p\|q) &:=\frac{1}{2}\left[ \mathcal{D}_{KL}\left(p\|m\right)+\mathcal{D}_{KL}\left(q\|m\right) \right]\\ \end{aligned} \end{equation}\]

其中 $m=\frac{p+q}{2}$。

JS散度 (Jensen-Shannon Divergence) 相比KL散度，具有对称性和有界性这两个性质，在GAN中被广泛应用。

\[\begin{equation} \begin{aligned} \mathcal{D}_{TV}(p,q) &:=\frac{1}{2}\int_{\mathbb{R}^D}|p(x)-q(x)|\mathrm{d}x\\ &=\sup_{A\subset \mathbb{R}^D}|p(A)-q(A)| \end{aligned} \end{equation}\]

TV距离 (Total Variation Distance) 则度量了两个分布之间概率差的上界。

1.4. Wasserstein距离

F散度族是利用概率密度的差异来度量两个分布之间的距离，Wasserstein距离则衡量了从分布 $p$ 转换为分布 $q$ 所需的最小代价。

具体来说，对于参数 $p\gt 1$，定义在度量空间 $(\chi,d)$ 上的两个分布 $P$ 和 $Q$ 之间的 p-Wasserstein 距离为：

\[\begin{equation} W_p(P,Q)=\left( \inf_{\gamma\in\Gamma(P,Q)}\int_{\chi\times\chi}d(x,y)^p\mathrm{d}\gamma(x,y) \right)^{1/p} \end{equation}\]

其中，$\gamma(x,y)$ 可以看作从点 $x\sim P$ 转移到点 $y\sim Q$ 的一个“计划”。

两个分布之间的Wasserstein距离是取决于分布的几何形态的差异，因此即使两个分布完全没有重合，也不会失效。

二、经典的深度生成模型

生成式建模的一个核心挑战在于，如何学习得到更具表达力的模型，使其能够拟合更加复杂的高维数据的分布。研究者们提出了各种各样的建模方法，尽可能使其在表达能力、可解释性和训练效率这三者之间取得较好的平衡。

在这一章中，我们将简要介绍一些经典的方法。

2.1. Energy-Based Models (EBMs)

EBM通过学习一个能量函数 $E_\phi$ 来得到模型分布 $p_\phi$，概率越高的数据点，所对应的能量则越低。

\[\begin{equation} p_\phi(x):=\frac{1}{Z(\phi)}\exp(-E_\phi(x)) \end{equation}\]

其中，

\[\begin{equation} Z(\phi)=\int\exp(-E_\phi(x))\mathrm{d}x \end{equation}\]

然而，由于 $Z(\phi)$ 是难解的，估计 $Z(\phi)$ 需要非常大的计算开销。

为了解决这个问题，Diffusion模型提出从 对数密度的梯度 出发来生成新数据，这样就能够摆脱对归一化项 $Z(\phi)$ 的依赖。

2.2. Auto-Regressive Models (AR)

自回归模型则通过把真实分布 $p_{data}$ 分解为一连串条件概率的乘积：

\[\begin{equation} p_{data}(x)=\prod_{i=1}^Dp_\phi(x_i\mid x_1,\dots,x_{i-1}) \end{equation}\]

自回归模型的好处在于能够准确写出似然函数的表达式，并且具有很强的分布建模能力。然而，由于它只能串行生成，因此其生成速度和灵活性会较为受限。

2.3. Variational Auto-Encoder (VAE)

变分自编码器 (VAE) 是自编码器 (AE) 的一个扩展，它引入了一个隐变量 $z$ 来捕捉数据 $x$ 的内在结构。

VAE并不是直接学习从 $x$ 到 $z$ 的直接映射方式，而是同时学习一个编码器 $q_\theta(z\mid x)$ 和一个解码器 $p_\phi(x\mid z)$。其中，编码器用于拟合隐变量 $z$ 的分布，解码器用于从隐变量中恢复出原始数据。

VAE在训练时使用了一个称为经验下界 (Evidence Lower Bound, ELBO) 的损失函数，这是对数似然函数的一个可解的下界估计：

\[\begin{equation} \mathcal{L}_{ELBO}(\theta,\phi\mid x)=\mathbb{E}_{q_\theta(z\mid x)}[\log p_\phi(x\mid z)] - \mathcal{D}_{KL}(q_\theta(z\mid x)\| p_{prior}(z)) \end{equation}\]

其中，等式的第一项称为重建误差，是用于鼓励解码器精确地从隐变量重建原始输入。第二项则是将隐变量的分布规整到一个简单的已知分布（一般就是高斯分布），这样在推理阶段，我们只需要从这个简单分布中采样一个隐变量 $z\sim p_{prior}$，再利用解码器进行解码，就可以生成一个新样本。

2.4. Normalizing Flows (NFs)

经典的流模型 (Flow-based Models) 有两个代表：标准化流 (Normalizing Flows, NFs) 和神经微分方程 (Neural Ordinary Differential Equations, NODEs)。这类方法都是在学习一个从简单的隐分布 $p(z)$ 映射到复杂的真实分布 $p_\phi(x)$ 的双射函数 (bijection) $f_\phi$。

这类模型的训练方法都是利用如下的密度还原公式 (change-of-variable formula of density)：

\[\begin{equation} \log p_\phi(x)=\log p(z)+\log\left| \det\frac{\partial f^{-1}_\phi(x)}{\partial x} \right| \end{equation}\]

这类方法的缺陷也是显然的。NFs方法对于模型的架构有着严格的要求，以此保证学习到的映射是一个双射。而NODEs则由于需要求解ODE，在训练效率上有较大缺陷。此外，当数据的维度较大时，这类方法的表现往往较差。

2.5. Generative Adversarial Networks (GANs)

GAN包含两个神经网络：一个生成器 $G_\phi$ 和一个判别器 $D_\zeta$。其中，生成器用于从随机噪声 $z\sim p_{prior}$ 中生成逼真的样本 $G_\phi(z)$ 以骗过判别器，而判别器则试图准确区分真实样本 $x$ 以及生成样本 $G_\phi(z)$。

二者通过一个min-max对抗过程来进行训练：

\[\begin{equation} \min_{G_\phi}\max_{D_\zeta}\mathbb{E}_{x\sim p_{data}}[\log D_\zeta(x)] + \mathbb{E}_{z\sim p_{prior}}[\log (1-D_\zeta(G_\phi(z)))] \end{equation}\]

在实际训练过程中，一般先更新判别器（一次或多次），再更新生成器。这样可以防止生成器过度优化，导致判别器完全不起作用，进而发生模式崩溃。

从散度的角度来看，判别器实际上是在隐式地度量真实分布 $p_{data}$ 和模型分布 $p_{G_\phi}$ 之间的差异。下面的引理说明了这一点。

Lemma 1. 当生成器 $G_\phi$ 固定时，最优的判别器为：

\[\begin{equation} D_\zeta^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_{G_{\phi}}(x)} \end{equation}\]

而在使用最优的判别器时，生成器的优化任务可以简化为：

\[\begin{equation} \min_{G_\phi}2\mathcal{D}_{JS}\left( p_{data}\| p_{G_{\phi}} \right)-\log 4 \end{equation}\]

引理的证明放在附录中。

从上面的引理我们可以看到，生成器实际上是在最小化两个分布之间的JS散度。f-GAN等工作进一步说明了，对抗训练甚至可以最小化任意的f散度。

附录

Proof of Lemma 1.

GAN的训练目标可以改为如下的形式：

\[\begin{equation} \min_{G}\max_{D}\mathbb{E}_{x\sim p_{data}}[\log D(x)] + \mathbb{E}_{x\sim p_{G}}[\log (1-D(x))] \end{equation}\]

当生成器固定时，$p_G$ 是固定的。此时判别器的优化目标可以改写为：

\[\begin{equation} \max_{D}\mathbb{E}_{x}[p_{data}(x)\log D(x) + p_G(x)\log(1-D(x))] \end{equation}\]

我们记单个样本 $x$ 的目标函数为 $\mathcal{L}D(x)=p{data}(x)\log D(x) + p_G(x)\log(1-D(x))$，对 $x$ 求导得：

\[\begin{equation} \begin{aligned} \frac{\mathrm{d}\mathcal{L}}{\mathrm{d}x} &=\frac{p_{data}(x)}{D(x)} - \frac{p_G(x)}{1-D(x)} \end{aligned} \end{equation}\]

令导数为0得：

\[\begin{equation} D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_{G}(x)} \end{equation}\]

我们将 $D^*$ 代入公式 $(19)$ 得：

\[\begin{equation} \begin{aligned} V(D^*,G) &=\mathbb{E}_{x\sim p_{data}}[\log D^*(x)] + \mathbb{E}_{x\sim p_{G}}[\log (1-D^*(x))]\\ &=\mathbb{E}_{x\sim p_{data}}\left[\log \frac{p_{data}(x)}{p_{data}(x) + p_{G}(x)}\right] + \mathbb{E}_{x\sim p_{G}}\left[\log \left(1-\frac{p_{data}(x)}{p_{data}(x) + p_{G}(x)}\right)\right]\\ &=\mathbb{E}_{x\sim p_{data}}\left[\log \frac{p_{data}(x)}{p_{data}(x) + p_{G}(x)}\right] + \mathbb{E}_{x\sim p_{G}}\left[\log \frac{p_{G}(x)}{p_{data}(x) + p_{G}(x)}\right]\\ &=\int p_{data}(x)\log \frac{p_{data}(x)}{p_{data}(x) + p_{G}(x)}\mathrm{d}x + \int p_{G}(x)\log \frac{p_{G}(x)}{p_{data}(x) + p_{G}(x)}\mathrm{d}x \end{aligned} \end{equation}\]

根据JS散度的定义：

其中，

\[\begin{equation} \begin{aligned} \mathcal{D}_{KL}\left(p_{data}\|m\right) &=\int p_{data}(x)\log\frac{p_{data}(x)}{m(x)}\mathrm{d}x\\ &=\int p_{data}(x)\log\frac{2 p_{data}(x)}{p_{data}(x)+p_G(x)}\mathrm{d}x\\ &=\log2 + \int p_{data}(x)\log\frac{p_{data}(x)}{p_{data}(x)+p_G(x)}\mathrm{d}x \end{aligned} \end{equation}\]

同理，我们有：

\[\begin{equation} \begin{aligned} \mathcal{D}_{KL}\left(p_{G}\|m\right) &=\log2 + \int p_{G}(x)\log\frac{p_{G}(x)}{p_{data}(x)+p_G(x)}\mathrm{d}x \end{aligned} \end{equation}\]

因此，

\[\begin{equation} \begin{aligned} \mathcal{D}_{JS}\left( p_{data}\| p_{G} \right) &=\frac{1}{2}\left[ \mathcal{D}_{KL}\left(p_{data}\|m\right)+\mathcal{D}_{KL}\left(p_G\|m\right) \right]\\ &=\log2 + \frac{1}{2}\left[\int p_{data}(x)\log \frac{p_{data}(x)}{p_{data}(x) + p_{G}(x)}\mathrm{d}x + \int p_{G}(x)\log \frac{p_{G}(x)}{p_{data}(x) + p_{G}(x)}\mathrm{d}x\right] \end{aligned} \end{equation}\]

代入公式 $(22)$ 得：

\[\begin{equation} \begin{aligned} V(D^*,G) &=\int p_{data}(x)\log \frac{p_{data}(x)}{p_{data}(x) + p_{G}(x)}\mathrm{d}x + \int p_{G}(x)\log \frac{p_{G}(x)}{p_{data}(x) + p_{G}(x)}\mathrm{d}x\\ &=2\mathcal{D}_{JS}\left( p_{data}\| p_{G} \right)-\log4 \end{aligned} \end{equation}\]

因此，生成器的优化目标为

\[\min_{G}V(D^*,G) = \min_{G}2\mathcal{D}_{JS}\left( p_{data}\| p_{G} \right)-\log4\]

得证。

Overview of Deep Generative Modeling

引言