机器学习: 1.5.3强化学习（Reinforcement learning）

这介于监督学习和无监督学习之间。算法会被告知答案何时是错误的，但不会被告知如何纠正。它必须探索并尝试不同的可能性，直到找到正确答案的方法。强化学习有时被称为带评论者的学习，正是因为这个监视器会给答案评分，但不会提出改进建议。

强化学习旨在让一个代理在环境中行动，以最大化其奖励。学习器（程序）不像大多数机器学习形式那样被告知要采取什么行动，而是必须通过尝试来发现哪些行动能带来最大的奖励。在最有趣和最具挑战性的情况下，行动不仅会影响即时奖励，还会影响接下来的情境，并通过情境影响所有后续奖励。

示例

考虑教狗一个新把戏：我们无法告诉它该怎么做，但如果它做对或做错了，我们可以奖励或惩罚它。它必须自己弄清楚是做了什么才得到了奖励或惩罚。我们可以使用类似的方法训练计算机完成许多任务，例如玩西洋双陆棋或国际象棋、安排工作和控制机器人肢体。强化学习与监督学习不同。监督学习是从知识渊博的专家提供的示例中学习。

This is somewhere between supervised and unsupervised learning. The algorithm gets told
when the answer is wrong, but does not get told how to correct it. It has to explore and try out different
possibilities until it works out how to get the answer right. Reinforcement learning is sometime called
learning with a critic because of this monitor that scores the answer, but does not suggest
improvements.
Reinforcement learning is the problem of getting an agent to act in the world so as to maximize
its rewards. A learner (the program) is not told what actions to take as in most forms of machine
learning, but instead must discover which actions yield the most reward by trying them. In the most
interesting and challenging cases, actions may affect not only the immediate reward but also the next
situations and, through that, all subsequent rewards.
Example
Consider teaching a dog a new trick: we cannot tell it what to do, but we can reward/punish it if
it does the right/wrong thing. It has to find out what it did that made it get the reward/punishment. We
can use a similar method to train computers to do many tasks, such as playing backgammon or chess,
scheduling jobs, and controlling robot limbs. Reinforcement learning is different from supervised
learning. Supervised learning is learning from examples provided by a knowledgeable expert.

最后修改: 2025年06月18日星期三 22:14