评估函数 Q(s, a) 的值定义为:在状态 s 执行动作 a 后立即获得的奖励,加上此后遵循最优策略所能获得的价值(按 γ 折扣)。

我们可以用 Q(s, a) 来重新表述公式 (3) 如下:

从公式 (5) 可以清楚看出,智能体只需考虑当前状态 s 下的每个可用动作 a,然后选择那个能最大化 Q(s, a) 值的动作。



The value of Evaluation function Q(s, a) is the reward received immediately upon executing
action a from state s, plus the value (discounted by γ ) of following the optimal policy thereafter
Rewrite Equation (3) in terms of Q(s, a) as
Equation (5) makes clear, it need only consider each available action a in its current state s and choose
the action that maximizes Q(s, a).

Last modified: Friday, 20 June 2025, 10:30 AM