机器学习: 4.3.4. 收敛性（Convergence）

Q 学习算法会收敛到等于真实 Q 函数的 Q 值吗？

是的，在特定条件下，Q 学习算法会收敛。这些条件包括：

假设系统是一个确定性马尔可夫决策过程（MDP）。
假设即时奖励值是有界的；也就是说，存在一个正数常数 c，使得对于所有状态 s 和动作 a，都有 $∣ r (s, a) ∣ < c$ 。
假设智能体选择动作的方式，使其能够无限次地访问每一个可能的状态-动作对。

Will the Q Learning Algorithm converge toward a Q equal to the true Q function?
Yes, under certain conditions.
1. Assume the system is a deterministic MDP.
2. Assume the immediate reward values are bounded; that is, there exists some positive constant c such
that for all states s and actions a, | r(s, a)| < c
3. Assume the agent selects actions in such a fashion that it visits every possible state-action pair
infinitely often

最后修改: 2025年06月20日星期五 10:45