Research works in Feb. 2025

2025-02-01

research works

Word count: 1.8k | Reading time≈ 7 min

DeepSeek 技术路线

DeepSeek-R1 paper

Blogs:

Sebastian Raschka: 关于DeepSeek R1和推理模型，我有几点看法

Sebastian Raschka: Understanding Reasoning LLMs

Google’s AI White Paper “Agents”

[译] AI Agent 白皮书

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Google DeepMind)

paper link

随着 LLMs 的快速发展，模型的规模（参数量）和计算资源的消耗都在急剧增加。传统上，提升模型性能的主要方法是通过增加模型的参数量（scaling model parameters）。然而，这种方法不仅需要大量的训练资源，还可能导致推理阶段的计算成本过高。因此，研究者们开始探索如何在推理阶段（test-time）优化计算资源的使用，以在不显著增加模型参数量的情况下提升模型性能。
这篇论文主要探讨：在不增加模型参数量或训练计算资源的情况下，如何通过合理优化推理阶段的计算资源（scaling test-time compute）来提升模型性能，也就是：

If an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt?

作者提出了两种策略来优化推理阶段的计算资源分配，其核心想法是在推理阶段，对于给定的 prompt，利用额外的计算资源，通过动态调整模型的输出分布，从而生成比直接从 LLM 中采样更好的结果。

Modifying the proposal distribution。对于给定的推理任务，通过 RL 等微调方法直接优化模型的输出结果。这种方式不会引入额外的 input tokens，只是通过 self-critique 和 iterative revision 来改进其输出分布 proposal distribution。
Optimizing the Verifier。Verifier 用于从 proposal distribution 中选择最优的答案，比如 best-of-N sampling。通过结合训练 process-based verifier 或者 process reward model，可以给出更优的方案。

对于每种策略，都有若干种方法来优化推理过程，例如，对于 Verifier，我们可以选择不同的算法，如 beam-search, lookahead-search, best-of-N。不同方法的有效性可能因具体问题而定。那么对于给定的问题/ prompt，如何选择最有效的方式来优化推理阶段的计算时间呢？与使用更大规模的预训练模型相比，结果会是怎样？

针对第一个问题，论文提出了一种推理阶段最优计算策略，旨在为给定的问题和 test-time compute budget 下，通过选择最优的超参数，最大化模型在推理阶段的性能。论文给出了优化目标：

\theta^*_{q, y^*(q)}(N) = \arg\max_{\theta} \left( \mathbb{E}_{y \sim \text{Target}(\theta, N, q)} [1_{y = y^*(q)}] \right),

其中， $q$ 为给定的 prompt， $N$ 为 compute budget， $\theta$ 为超参数，比如 best-of-N sampling 中的参数 $N$ 。 $\text{Target}(\theta, N, q)$ 为语言模型在给定 compute budget $N$ 、prompt $q$ 下，超参数为 $\theta$ 时的输出分布。 $1_{y = y^*(q)}$ 是指示函数，当模型输出 $y$ 等于真实答案 $y^*(q)$ 时为1，否则为0。

直接求解最优策略 $\theta^*_{q, y^*(q)}(N)$ 是很难做到的，因此论文引入问题难度 来作为充分统计量，设计一种近似最优策略。在实验环节，通过比较不同方法在不同难度等级问题上的表现，论文进一步说明了可以根据问题难度选择不同的策略来优化推理过程。
论文通过在验证集上进行难度估计，但是对于一个新的问题，如何估计问题的难度，从而选择针对新问题的计算策略，仍然有待研究。

一些实验结果发现：

The efficacy of any given verifier search method depends critically on both the compute
budget and the question at hand. Specifically, beam-search is more effective on harder questions
and at lower compute budgets, whereas best-of-N is more effective on easier questions and at higher
budgets. Moreover, by selecting the best search setting for a given question difficulty and test-time
compute budget, we can nearly outperform best-of-N using up to 4x less test-time compute.
There exists a tradeoff between sequential (e.g. revisions) and parallel (e.g. standard
best-of-N) test-time computation, and the ideal ratio of sequential to parallel test-time compute
depends critially on both the compute budget and the specific question at hand. Specifically,
easier questions benefit from purely sequential test-time compute, whereas harder questions often
perform best with some ideal ratio of sequential to parallel compute. Moreover, by optimally
selecting the best setting for a given question difficulty and test-time compute budget, we can
outperform the parallel best-of-N baseline using up to 4x less test-time compute.

论文还研究了预训练阶段和推理阶段的计算资源分配问题，给出以下结论：

Test-time and pretraining compute are not 1-to-1 “exchangeable”. On easy and medium questions,
which are within a model’s capabilities, or in settings with small inference requirement, test-time
compute can easily cover up for additional pretraining. However, on challenging questions which
are outside a given base model’s capabilities or under higher inference requirement, pretraining is
likely more effective for improving performance.

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

paper link

近期工作发现，通过在推理阶段（inference-time）分配更多的计算资源，LLM可以产生更高质量的输出。因此，推理阶段计算资源的分配问题成为提高模型性能的研究路径之一。在 diffusion models中，基于 pure noise 不断去噪得到 clean data，因而可以通过控制去噪的步数来调整图像的质量，这使得 diffusion models 在推理阶段可以更灵活的分配计算资源。生成模型的 computation budget 通常使用 NFE (number of function evaluations) 进行评估。经验发现，如果仅仅通过提升去噪步数增加推理时间，在经过一定的 NFE 后，算法性能提升会趋于平稳，从而增加推理时间无法进一步带来性能增益。因此，以往关于 diffusion models 的研究工作主要聚焦于高性能输出，同时希望推理阶段的效率较高，也就是 NFE 越小越好。与已有工作不同，本文聚焦于如何在推理阶段通过搜索有效的计算方法来提高模型的性能。

Diffusion models 中采样的随机性主要来自于几个方面，比如初始噪声的选取，在去噪过程中再次引入的额外噪声等。不同的噪声会影响样本的生成质量，因此本文探索如何通过选择最优的噪声来提升模型的采样质量和效率。总之可以概括为两个问题：一是如何知道哪些采样噪声是好的？二是如何搜索采样噪声？基于这两个问题，论文考虑了设计一个 Verifiers 用于评估候选采样噪声的好坏，以及一个 Algorithms，Algorithms 基于 Verifiers 的评分找到最优的初始采样噪声。

DDIM 简明讲解与 PyTorch 实现：加速扩散模型采样的通用方法

优质博客：

DDIM 简明讲解与 PyTorch 实现：加速扩散模型采样的通用方法

Flow Matching

优质博客：

Flow Matching 理论详解

深入解析Flow Matching技术

Flow-based Deep Generative Models

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.