机器学习: 4.3.1. Q 函数（The Q Function）

评估函数 Q(s, a) 的值定义为：在状态 s 执行动作 a 后立即获得的奖励，加上此后遵循最优策略所能获得的价值（按 γ 折扣）。

我们可以用 Q(s, a) 来重新表述公式 (3) 如下：

$Q (s, a) = r (s, a) + γ max a' Q (δ (s, a), a')$

从公式 (5) 可以清楚看出，智能体只需考虑当前状态 s 下的每个可用动作 a，然后选择那个能最大化 Q(s, a) 值的动作。

The value of Evaluation function Q(s, a) is the reward received immediately upon executing
action a from state s, plus the value (discounted by γ ) of following the optimal policy thereafter
Rewrite Equation (3) in terms of Q(s, a) as
Equation (5) makes clear, it need only consider each available action a in its current state s and choose
the action that maximizes Q(s, a).

Last modified: Friday, 20 June 2025, 10:30 AM