DiffusionModel-Classifier Free Guidance Diffusion

论文: https://arxiv.org/abs/2207.12598

MOTIVATION

We are interested in whether classifier guidance can be performed without a classifier.

Classifier guidance complicates the diffusion model training pipeline
- it requires training an extra classifier
- this classifier must be trained on noisy data so it is generally not possible to plug in a pre-trained classifier.
Furthermore, classifier guidance mixes a score estimate with a classifier gradient during sampling
- classifier-guided diffusion sampling can be interpreted as attempting to confuse an image classifier with a gradient-based adversarial attack.
- This raises the question of whether classifier guidance is successful at boosting classifier-based metrics such as FID and Inception score (IS) simply because it is adversarial against such classifiers.

CONTRIBUTION

we present classifier-free guidance, our guidance method which avoids any classifier entirely
Rather than sampling in the direction of the gradient of an image classifier, classifier-free guidance instead mixes the score estimates of a conditional diffusion model and a jointly trained unconditional diffusion model.

BACKGROUND

Continuous Time Training

$\begin{aligned}q(\mathbf{z}_{\lambda}|\mathbf{x})&=\mathcal{N}(\alpha_{\lambda}\mathbf{x},\sigma_{\lambda}^{2}\mathbf{I}), \mathrm{where} \alpha_{\lambda}^{2}=1/(1+e^{-\lambda}), \sigma_{\lambda}^{2}=1-\alpha_{\lambda}^{2}\\q(\mathbf{z}_{\lambda}|\mathbf{z}_{\lambda^{\prime}})&=\mathcal{N}((\alpha_{\lambda}/\alpha_{\lambda^{\prime}})\mathbf{z}_{\lambda^{\prime}},\sigma_{\lambda|\lambda^{\prime}}^{2}\mathbf{I}), \mathrm{where} \lambda<\lambda^{\prime}, \sigma_{\lambda|\lambda^{\prime}}^{2}=(1-e^{\lambda-\lambda^{\prime}})\sigma_{\lambda}^{2}\end{aligned}$

$\mathbf{x}\sim p(\mathbf{x})$ :initial distribution
$\mathbf{z}=\{\mathbf{z}_{\lambda}\mid\lambda\in[\lambda_{\operatorname*{min}},\lambda_{\operatorname*{max}}]\}$
$p (z)$ ( or $p({z_{\lambda}}))$ :the marginal of $z$ (or ${z_{\lambda}})$ when $\mathbf{x}\sim p(\mathbf{x})$ and $\mathbf{z}\sim q(\mathbf{z}|\mathbf{x})$
$\lambda=\log\alpha_{\lambda}^{2}/\sigma_{\lambda}^{2},$ :the log signal-to-noise ratio of $z_{\lambda}$ the forward process runs in the direction of decreasing λ.

Conditioned on x

forward process: $q(\mathbf{z}_{\lambda^{\prime}}|\mathbf{z}_{\lambda},\mathbf{x})=\mathcal{N}(\tilde{\boldsymbol{\mu}}_{\lambda^{\prime}|\lambda}(\mathbf{z}_{\lambda},\mathbf{x}),\tilde{\sigma}_{\lambda^{\prime}|\lambda}^{2}\mathbf{I},$
- $\tilde{\boldsymbol{\mu}}_{\lambda'|\lambda}(\mathbf{z}_{\lambda},\mathbf{x})=e^{\lambda-\lambda'}(\alpha_{\lambda'}/\alpha_{\lambda})\mathbf{z}_{\lambda}+(1-e^{\lambda-\lambda'})\alpha_{\lambda'}\mathbf{x}$
- $\tilde{\sigma}_{\lambda'|\lambda}^{2}=(1-e^{\lambda-\lambda'})\sigma_{\lambda'}^{2}$
reverse process generative model
- start from $p_{\theta}(\mathbf{z}_{\lambda_{\mathrm{min}}})=\mathcal{N}(\mathbf{0},\mathbf{I})$
- $p_\theta(\mathbf{z}_{\lambda'}|\mathbf{z}_\lambda)=\mathcal{N}(\tilde{\boldsymbol{\mu}}_{\lambda'|\lambda}(\mathbf{z}_\lambda,\mathbf{x}_\theta(\mathbf{z}_\lambda)),(\tilde{\sigma}_{\lambda'|\lambda}^2)^{1-v}(\sigma_{\lambda|\lambda'}^2)^v)$
  - During sampling, we apply this transition along an increasing sequence $λ_{min} = λ_1 < · · · < λ_T = λ_{max}$ for T timesteps;
  - parameterize $x_θ$ in terms of $\epsilon$ -prediction： $x_θ(z_λ) = (z_λ−σ_λ{\epsilon}_θ(z_λ))/α_λ$
  - If the model $x_θ$ is correct, then as T →∞, we obtain samples from an SDE whose sample paths are distributed as $p (z)$ (Song et al., 2021b), and we use $p_θ(z)$ to denote the continuous time model distribution.
- The variance
  - The variance is a log-space interpolation of $\tilde{\sigma}_{\lambda^{\prime}|\lambda}^{2}$ and ${\sigma}_{\lambda^{\prime}|\lambda}^{2}$ 【通过一个对数空间的插值方法进行连接】
  - we found it effective to use a constant hyperparameter $v$ rather than learned $z_λ$ -dependent $v$ .【这种插值方法使用了一个恒定的超参数 $v$ ，而不是依赖于 $λ$ 的可学习参数 $v$ 】
  - Note that the variances simplify to $\tilde{\sigma}_{\lambda^{\prime}|\lambda}^{2}$ as λ’ → λ, so $v$ has an effect only when sampling with non-infinitesimal timesteps as done in practice.【在实际的采样过程中，时间步长通常不是无穷小的，因此超参数 $v$ 对于确定在每一步中的方差变化很重要】
- the mean
  - The reverse process mean comes from an estimate $x_θ(z_λ)$ ( $x_θ$ ignore input $λ$ for simple) ≈ x plugged into $q(z_{λ'}|z_λ, x)$
- we train on the objective: $\mathbb{E}_{\boldsymbol{\epsilon},\lambda}\big[\|\boldsymbol{\epsilon}_\theta(\mathbf{z}_\lambda)-\boldsymbol{\epsilon}\|_2^2\big]$
  - $\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})$
  - $\mathbf{z}_{\lambda}=\alpha_{\lambda}\mathbf{x}+\sigma_{\lambda}\mathbf{\epsilon},$
  - $\lambda$ :is drawn from a distribution $p(\lambda)$ over $[\lambda_{\min},\lambda_{\max}]$
    - when $p (λ)$ is uniform, the objective of score matching over multiple noise scales is proportional to the variational lower bound on the marginal log likelihood of the latent variable model $\int p_{\theta}(\mathbf{x}|\mathbf{z})p_{\theta}(\mathbf{z})d\mathbf{z},$ ignoring the term for the unspecified decoder $p_{\theta}(\mathbf{x}|\mathbf{z})$ and for the prior at $p_{\theta}(\mathbf{x}|\mathbf{z})$ 【当 p(λ) 不是均匀分布时，去噪得分匹配的目标可以被解释为加权变分下界】
    - If $p (λ)$ is not uniform, the objective can be interpreted as weighted variational lower bound whose weighting can be tuned for sample quality【当 p(λ) 不是均匀分布时，去噪得分匹配的目标可以被解释为加权变分下界，其权重可以根据需要调整以优化样本质量。】

choose of $p (λ)$

we sample $λ$ via $log\ tan(au + b)$ for uniformly distributed $u \in [0, 1]$ , where $b = arctan(e^{−λ_{max}/2})$ and $a = arctan(e^{−λ_{min}/2}) − b$ .
This represents a hyperbolic secant distribution modified to be supported on a bounded interval. 【这种方法得到的 p(λ) 是一个修改后的双曲正割分布，它被调整为支持于在有界区间 [λmin,λmax] 。】
For finite timestep generation, we use $λ$ values corresponding to uniformly spaced $u \in [0, 1]$ , and the final generated sample is $x_θ(z_{λ_{max}} )$ .【在有限时间步长生成中，λ 的值对应于均匀间隔的 $u \in [0, 1]$ 最终生成的样本是 $x_θ(z_{λ_{max}} )$ 】
loss for ${\epsilon}_θ(z_λ)$ is denoising score matching for all $λ$
- the score ${\epsilon}_θ(z_λ)$ learned by our model estimates the gradient of the log-density of the distribution of our noisy data $z_λ$ 【模型学习得到的得分函数 ${\epsilon}_θ(z_λ)$ 用来估计加噪数据 $z_λ$ 分布的对数梯度的】
- $\epsilon_{\theta}(\mathbf{z}_{\lambda})\approx-\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}\log p(\mathbf{z}_{\lambda})$
- because we use unconstrained neural networks to define $\epsilon_{\theta}$ , there need not exist any scalar potential whose gradient is $ $\epsilon_{\theta}$ 【 $\epsilon_{\theta}$ 是通过无约束神经网络定义的，不一定存在一个标量势能函数，其梯度恰好等于 $\epsilon_{\theta}$ 】
- Sampling from the learned diffusion model resembles using Langevin diffusion to sample from a sequence of distributions $p(z_λ)$ that converges to the conditional distribution $p (x$ of the original data x.

METHODS

CLASSIFIER GUIDANCE

diffusion score: $\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{\lambda},\mathbf{c}) \approx -\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}\operatorname{log}p(\mathbf{z}_{\lambda}|\mathbf{c})$

where the diffusion score $\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{\lambda},\mathbf{c}) \approx -\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}\operatorname{log}p(\mathbf{z}_{\lambda}|\mathbf{c})$ is modified to include the gradient of the log likelihood of an auxiliary classifier model $p_θ(c|z_λ)$ as follows:

$\tilde{\boldsymbol{\epsilon}}_{\theta}(\mathbf{z}_{\lambda},\mathbf{c})=\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{\lambda},\mathbf{c})-w\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}\log p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})\approx-\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}[\log p(\mathbf{z}_{\lambda}|\mathbf{c})+w\log p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})],$

w:a parameter that controls the strength of the classifier guidance

this modified score $\tilde{\boldsymbol{\epsilon}}_{\theta}(\mathbf{z}_{\lambda},c)$ is then used in place of ${\boldsymbol{\epsilon}}_{\theta}(\mathbf{z}_{\lambda},c)$ when sampling from the diffusion model, resulting in approximate samples from the distribution:
$\tilde{p}_{\theta}(\mathbf{z}_{\lambda}|\mathbf{c})\propto p_{\theta}(\mathbf{z}_{\lambda}|\mathbf{c})p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})^{w}.$
在这里插入图片描述

在这里插入图片描述

As guidance strength is increased, each conditional places probability mass farther away from other classes and towards directions of high confidence given by logistic regression, and most of the mass becomes concentrated in smaller regions. This behavior can be seen as a simplistic manifestation of the Inception score boost and sample diversity decrease that occur when classifier guidance strength is increased in an ImageNet model.

Applying classifier guidance with weight $w + 1$ to an unconditional model would theoretically lead to the same result as applying classifier guidance with weight $w$ to a conditional model

because $\tilde{p}_{\theta}(\mathbf{z}_{\lambda}|\mathbf{c})\propto p_{\theta}(\mathbf{z}_{\lambda}|\mathbf{c})p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})^{w}\propto p_{\theta}(\mathbf{z}_{\lambda}|\mathbf{c})p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})^{w+1}.$ ;
or in terms of scores: $\begin{aligned} \boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{\lambda})-(w+1)\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}\operatorname{log}p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})& \approx-\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}[\operatorname{log}p(\mathbf{z}_{\lambda})+(w+1)\operatorname{log}p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})] \\ &=-\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}[\log p(\mathbf{z}_{\lambda}|\mathbf{c})+w\log p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})], \end{aligned}$

CLASSIFIER-FREE GUIDANCE

联合训练无条件和条件模型:Instead of training a separate classifier model, we choose to train an unconditional denoising diffusion model $p_θ(z)$ parameterized through a score estimator ${\epsilon}_{\theta}(\mathbf{z}_{\lambda})$ together with the conditional model $p_θ(z|c)$ parameterized through ${\epsilon}_{\theta}(\mathbf{z}_{\lambda},c)$ .
Frist,We use a single neural network to parameterize both models,
- unconditional model: $\epsilon_{\theta}(\mathbf{z}_{\lambda})=\epsilon_{\theta}(\mathbf{z}_{\lambda},\mathbf{c}=\varnothing)$ .
- conditional model: $\epsilon_{\theta}(\mathbf{z}_{\lambda},c)$
- We jointly train the unconditional and conditional models simply by randomly setting c to the unconditional class identifier ∅ with some probability puncond, set as a hyperparameter
Then we perform sampling using the following linear combination of the conditional and unconditional score estimates:
$\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{z}_\lambda,\mathbf{c})=(1+w)\boldsymbol{\epsilon}_\theta(\mathbf{z}_\lambda,\mathbf{c})-w\boldsymbol{\epsilon}_\theta(\mathbf{z}_\lambda)$
- this Eq has no classifier gradient present, so taking a step in the $\tilde{\boldsymbol{\epsilon}_{\theta}}$ direction cannot be interpreted as a gradient-based adversarial attack on an image classifier.
- Furthermore, $\tilde{\boldsymbol{\epsilon}_{\theta}}$ is constructed from score estimates that are non-conservative vector fields due to the use of unconstrained neural networks,
- so there in general cannot exist a scalar potential such as a classifier log likelihood for which $\tilde{\boldsymbol{\epsilon}_{\theta}}$ is the classifier-guided score.
Classifier-Free-Guidance VS Classifier-Guidance
- inspired by the gradient of an implicit classifier $p^{i}(\mathbf{c}|\mathbf{z}_{\lambda})\propto p(\mathbf{z}_{\lambda}|\mathbf{c})/p(\mathbf{z}_{\lambda})$
  - If we had access to exact scores $\epsilon^*(\mathbf{z}_\lambda, c)$ (of $p(\mathbf{z}_\lambda|c)$ )， $\epsilon^*(\mathbf{z}_\lambda)$ (of $p(\mathbf{z}_\lambda)$ ),
  - then the gradient of this implicit classifier would be $\nabla_{\mathbf{z}_\lambda} \log p^i(c|\mathbf{z}_\lambda) = -\frac{1}{\sigma_\lambda} [\epsilon^*(\mathbf{z}_\lambda, c) - \epsilon^*(\mathbf{z}_\lambda)]$
  - and classifier guidance with this implicit classifier would modify the score estimate into $\tilde{\epsilon}^*(\mathbf{z}_\lambda, c) = (1 + w) \epsilon^*(\mathbf{z}_\lambda, c) - w \epsilon^*(\mathbf{z}_\lambda)$
- this Eq resembles Eq mentioned above,but they differs
  - The $\tilde{\epsilon}^*(\mathbf{z}_\lambda, c)(Classifier-Guidance)$ is constructed from the scaled classifier gradient $\epsilon^*(\mathbf{z}_\lambda,\mathbf{c})-\boldsymbol{\epsilon}^*(\mathbf{z}_\lambda)$
  - the ${\epsilon}^*(\mathbf{z}_\lambda, c)(Classifier-Free-Guidance)$ is constructed from the estimate $\epsilon_{\theta}(\mathbf{z}_{\lambda},\mathbf{c})-\epsilon_{\theta}(\mathbf{z}_{\lambda})$ , and this expression is not in general the (scaled) gradient of any classifier, again because the score estimates are the outputs of unconstrained neural networks.

EXPERIMENTS

在这里插入图片描述

实验目的:实验的主要目的是证明无需分类器引导能够实现与分类器引导相似的 FID（Fréchet Inception Distance）和 IS（Inception Score）之间的权衡，并且理解无需分类器引导的行为。
实验设置:作者在下采样的类条件 ImageNet 数据集上训练了扩散模型。这是研究 FID 和 IS 权衡的标准设置，从 BigGAN 论文开始。
模型架构和超参数:为了与之前的工作进行公平比较，作者使用了与 Dhariwal & Nichol (2021) 的引导扩散模型相同的模型架构和超参数设置，尽管这些设置是为分类器引导调整的，可能对无需分类器引导不是最优的。
无需分类器引导的实现:作者展示了无需分类器引导的结果，证明了纯生成扩散模型能够合成与其他类型生成模型可能的极高保真度样本。
实验结果: 实验结果显示，无需分类器引导能够在 FID 和 IS 之间实现类似的权衡，并且在某些情况下，作者的模型在样本质量指标上与之前的作品相比具有竞争力，有时甚至更优。
引导强度的调整:
- 作者通过调整引导强度参数 $w$ 来展示在 64x64 和 128x128 类条件 ImageNet 生成中样本质量的影响。实验结果表明，使用较小的引导强度可以获得最佳的 FID 结果，而较强的引导强度可以获得最佳的 IS 结果。
无条件训练概率的调整:
- 作者研究了在训练过程中无条件生成的概率 $p_{\text{uncond}}$ 对样本质量的影响。实验结果表明，较小的 $p_{\text{uncond}}$ 值（如 0.1 或 0.2）在整个 IS/FID 前沿上的表现优于 $p_{\text{uncond}} = 0.5$ 。
采样步骤数量的调整:
- 作者还研究了采样步骤数量 $T$ 对 128x128 ImageNet 模型样本质量的影响。实验结果表明，增加 $T$ 可以提高样本质量，但对于该模型， $T = 256$ 在样本质量和采样速度之间取得了良好的平衡。