机器学习: 4.3.2. Q 学习算法（An Algorithm for Learning Q）

学习 Q 函数就等同于学习最优策略。
关键问题在于，我们只有一系列随时间分散的即时奖励 r，如何找到一种可靠的方法来估计 Q 的训练值。这可以通过迭代近似来实现。

我们可以重写 Q 函数的定义：

Q 学习算法

这是一个假设奖励和动作都是确定性的 Q 学习算法。折扣因子 γ 可以是任何常数，只要满足 $0 \leq γ < 1$ 即可。

我们用 Q^ 来表示学习器对实际 Q 函数的估计或假设。

 Learning the Q function corresponds to learning the optimal policy.

The key problem is finding a reliable way to estimate training values for Q, given only a sequence of
immediate rewards r spread out over time. This can be accomplished through iterative
approximation
Rewriting Equation
Q learning algorithm:

Q learning algorithm assuming deterministic rewards and actions. The discount factor γ may be any
constant such that 0 ≤ γ < 1

Q̂ to refer to the learner's estimate, or hypothesis, of the actual Q function

Last modified: Friday, 20 June 2025, 10:35 AM