Python机器学习项目: 步骤 2 — 使用 Gym 创建基线随机智能体

现在所需的软件已安装在你的服务器上，你将设置一个智能体来玩经典 Atari 游戏《太空入侵者》的简化版本。对于任何实验，都有必要获得一个基线，以帮助你了解模型表现如何。由于这个智能体在每个帧都采取随机行动，我们将其称为随机基线智能体。在这种情况下，你将与这个基线智能体进行比较，以了解你的智能体在后续步骤中的表现。

使用 Gym，你需要维护自己的游戏循环。这意味着你处理游戏执行的每一步：在每个时间步，你向 Gym 提供一个新动作，并向 Gym 请求游戏状态。在本教程中，游戏状态是游戏在给定时间步的出现情况，这正是你在玩游戏时会看到的样子。

使用你喜欢的文本编辑器，创建一个名为 bot_2_random.py 的 Python 文件。这里，我们将使用 nano：

Bash

nano bot_2_random.py

注意： 在本指南中，机器人的名称与它们出现的步骤号保持一致，而不是它们出现的顺序。因此，这个机器人名为 bot_2_random.py，而不是 bot_1_random.py。

通过添加以下高亮显示的行来启动此脚本。这些行包括一个注释块，解释了此脚本将做什么，以及两个 import 语句，它们将导入此脚本最终需要运行的包：

/AtariBot/bot_2_random.py

Python

"""
Bot 2 -- Make a random, baseline agent for the SpaceInvaders game.
"""
import gym
import random

添加一个 main 函数。在这个函数中，创建游戏环境 — SpaceInvaders-v0 — 然后使用 env.reset 初始化游戏：

/AtariBot/bot_2_random.py

Python

. . .
import gym
import random

def main():
    env = gym.make('SpaceInvaders-v0')
    env.reset()

接下来，添加一个 env.step 函数。此函数可以返回以下类型的值：

state：应用所提供的动作后，游戏的新状态。
reward：该状态带来的分数增加。例如，当子弹摧毁了外星人，分数增加了 50 分。那么，reward = 50。在玩任何基于分数的游戏时，玩家的目标是最大化分数。这与最大化总奖励是同义的。
done：回合是否结束，通常发生在玩家失去所有生命时。
info：暂且搁置的额外信息。

你将使用 reward 来计算你的总奖励。你还将使用 done 来确定玩家何时死亡，即当 done 返回 True 时。

添加以下游戏循环，它指示游戏循环直到玩家死亡：

/AtariBot/bot_2_random.py

Python

. . .
def main():
    env = gym.make('SpaceInvaders-v0')
    env.reset()
    episode_reward = 0
    while True:
        action = env.action_space.sample()
        _, reward, done, _ = env.step(action)
        episode_reward += reward
        if done:
            print('Reward: %s' % episode_reward)
            break

最后，运行 main 函数。包含一个 __name__ 检查，以确保只有当你直接使用 python bot_2_random.py 调用它时，main 才运行。如果你不添加 if 检查，即使你导入该文件，当 Python 文件执行时，main 也会始终被触发。因此，将代码放在一个 main 函数中，只有当 __name__ == '__main__' 时才执行，这是一个好习惯。

/AtariBot/bot_2_random.py

Python

. . .
def main():
    . . .
    if done:
        print('Reward %s' % episode_reward)
        break

if __name__ == '__main__':
    main()

保存文件并退出编辑器。如果你使用 nano，请按 CTRL+X、Y，然后按 ENTER。然后，通过输入以下命令运行你的脚本：

Bash

python bot_2_random.py

你的程序将输出一个数字，类似于以下内容。请注意，每次运行文件时，你都会得到不同的结果：

Output
Making new env: SpaceInvaders-v0
Reward: 210.0

这些随机结果带来了一个问题。为了产生其他研究人员和实践者可以受益的工作，你的结果和试验必须是可复现的。为了纠正这个问题，重新打开脚本文件：

Bash

nano bot_2_random.py

在 import random 之后，添加 random.seed(0)。在 env = gym.make('SpaceInvaders-v0') 之后，添加 env.seed(0)。这些行共同用一个一致的起始点“播种”环境，确保结果将始终可复现。你的最终文件将与以下内容完全匹配：

/AtariBot/bot_2_random.py

Python

"""
Bot 2 -- Make a random, baseline agent for the SpaceInvaders game.
"""
import gym
import random

random.seed(0)

def main():
    env = gym.make('SpaceInvaders-v0')
    env.seed(0)
    env.reset()
    episode_reward = 0
    while True:
        action = env.action_space.sample()
        _, reward, done, _ = env.step(action)
        episode_reward += reward
        if done:
            print('Reward: %s' % episode_reward)
            break

if __name__ == '__main__':
    main()

保存文件并关闭编辑器，然后在终端中输入以下命令运行脚本：

Bash

python bot_2_random.py

这将精确地输出以下奖励：

Output
Making new env: SpaceInvaders-v0
Reward: 555.0

这是你的第一个机器人，尽管它相当不智能，因为它在做决策时没有考虑周围的环境。为了更可靠地估计你的机器人性能，你可以让智能体一次运行多个回合，报告多个回合的平均奖励。要配置此功能，首先重新打开文件：

Bash

nano bot_2_random.py

在 random.seed(0) 之后，添加以下高亮行，它告诉智能体玩 10 个回合：

/AtariBot/bot_2_random.py

Python

. . .
random.seed(0)
num_episodes = 10
. . .

就在 env.seed(0) 之后，开始一个新的奖励列表：

/AtariBot/bot_2_random.py

Python

. . .
env.seed(0)
rewards = []
. . .

将从 env.reset() 到 main() 结束的所有代码嵌套在一个 for 循环中，迭代 num_episodes 次。确保将从 env.reset() 到 break 的每一行都缩进四个空格：

/AtariBot/bot_2_random.py

Python

. . .
def main():
    env = gym.make('SpaceInvaders-v0')
    env.seed(0)
    rewards = []
    for _ in range(num_episodes):
        env.reset()
        episode_reward = 0
        while True:
            ...

就在 break 之前，目前是主游戏循环的最后一行，将当前回合的奖励添加到所有奖励列表中：

/AtariBot/bot_2_random.py

Python

. . .
        if done:
            print('Reward: %s' % episode_reward)
            rewards.append(episode_reward)
            break
. . .

在 main 函数的末尾，报告平均奖励：

/AtariBot/bot_2_random.py

Python

. . .
def main():
    ...
    print('Reward: %s' % episode_reward)
    break
    print('Average reward: %.2f' % (sum(rewards) / len(rewards)))
. . .

你的文件现在将与以下内容一致。请注意，以下代码块包含了一些注释，以阐明脚本的关键部分：

/AtariBot/bot_2_random.py

Python

"""
Bot 2 -- Make a random, baseline agent for the SpaceInvaders game.
"""
import gym
import random

random.seed(0)  # make results reproducible
num_episodes = 10

def main():
    env = gym.make('SpaceInvaders-v0')  # create the game
    env.seed(0)  # make results reproducible
    rewards = []
    for _ in range(num_episodes):
        env.reset()
        episode_reward = 0
        while True:
            action = env.action_space.sample()
            _, reward, done, _ = env.step(action)  # random action
            episode_reward += reward
            if done:
                print('Reward: %d' % episode_reward)
                rewards.append(episode_reward)
                break
    print('Average reward: %.2f' % (sum(rewards) / len(rewards)))

if __name__ == '__main__':
    main()

保存文件，退出编辑器，然后运行脚本：

Bash

python bot_2_random.py

这将精确地打印出以下平均奖励：

Output
Making new env: SpaceInvaders-v0
. . .
Average reward: 163.50

我们现在对要超越的基线分数有了更可靠的估计。然而，要创建一个更优秀的智能体，你需要了解强化学习的框架。如何将“决策”这个抽象概念变得更具体呢？

理解强化学习

在任何游戏中，玩家的目标都是最大化他们的分数。在本指南中，玩家的分数被称为其奖励 (reward)。为了最大化他们的奖励，玩家必须能够完善他们的决策能力。形式上，一个决策是观察游戏（或观察游戏状态）并选择一个动作的过程。我们的决策函数称为策略 (policy)；一个策略接受一个状态作为输入并“决定”一个动作：

$策略 : 状态 \to 动作$

为了构建这样一个函数，我们将从强化学习中一组特定的算法开始，称为 Q-学习算法 (Q-learning algorithms)。为了说明这些算法，考虑游戏的初始状态，我们称之为 state0：你的飞船和外星人都在它们的起始位置。然后，假设我们可以访问一个神奇的“Q-表”，它告诉我们每个动作将获得多少奖励：

状态	动作	奖励
state0	shoot	10
state0	right	3
state0	left	3

shoot 动作将最大化你的奖励，因为它产生了最高值的奖励：10。如你所见，Q-表提供了一种基于观察到的状态进行决策的直接方法：

$策略 : 状态 \to 查看 Q- 表，选择奖励最大的动作$

然而，大多数游戏有太多的状态无法在表中列出。在这种情况下，Q-学习智能体会学习一个 Q-函数而不是 Q-表。我们使用这个 Q-函数的方式与我们之前使用 Q-表的方式类似。将表项重写为函数，我们得到以下内容：

$Q (state0, shoot) = 10 Q (state0, right) = 3 Q (state0, left) = 3$

给定一个特定的状态，我们很容易做出决策：我们只需查看每个可能的动作及其奖励，然后采取与最高预期奖励对应的动作。更正式地重新表述早期的策略，我们有：

$策略 : 状态 \to argmax 动作 Q (状态, 动作)$

这满足了决策函数的要求：给定游戏中的一个状态，它会决定一个动作。然而，这个解决方案依赖于知道每个状态和动作的 $Q (t e x t s t a t e, t e x t a c t i o n)$ 。为了估计 $Q (t e x t s t a t e, t e x t a c t i o n)$ ，考虑以下几点：

给定智能体状态、动作和奖励的多次观察，可以通过取运行平均值来获得每个状态和动作的奖励估计。
《太空入侵者》是一款延迟奖励的游戏：玩家在外星人被炸毁时获得奖励，而不是在玩家射击时获得奖励。然而，玩家通过射击采取行动才是奖励的真正动力。Q-函数必须以某种方式将 (state0,shoot) 分配一个正奖励。

这两点见解被编入了以下方程式：

$Q(\text{state}, \text{action}) = (1 - \text{learning_rate}) \times Q(\text{state}, \text{action}) + \text{learning_rate} \times Q_{target}$

$Q_{target} = \text{reward} + \text{discount_factor} \times \text{max}_{\text{action}'} Q(\text{state}', \text{action}')$

这些方程使用以下定义：

state：当前时间步的状态
action：当前时间步采取的动作
reward：当前时间步的奖励
state'：给定我们采取动作 a 后，下一时间步的新状态
action'：所有可能的动作
learning_rate：学习率
discount_factor：折扣因子，奖励在传播时“衰减”多少

要完整解释这两个方程，请参阅这篇关于理解 Q-学习的文章：Understanding Q-Learning。

有了对强化学习的这种理解，剩下的就是实际运行游戏并为新的策略获得这些 Q 值估计。

你已经成功创建了一个随机的基线智能体，并了解了强化学习的基本概念。现在我们准备好进入下一步，开始构建更智能的智能体了吗？

Step 2 — Creating a Baseline Random Agent with Gym
Now that the required software is on your server, you will set up an
agent that will play a simplified version of the classic Atari game, Space
Invaders. For any experiment, it is necessary to obtain a baseline to help
you understand how well your model performs. Because this agent takes
random actions at each frame, we’ll refer to it as our random, baseline
agent. In this case, you will compare against this baseline agent to
understand how well your agents perform in later steps.
With Gym, you maintain your own game loop. This means that you
handle every step of the game’s execution: at every time step, you give
the gym a new action and ask gym for the game state. In this tutorial, the
game state is the game’s appearance at a given time step, and is precisely
what you would see if you were playing the game.
Using your preferred text editor, create a Python file named
bot_2_random.py. Here, we’ll use nano:
nano bot_2_random.py
Note: Throughout this guide, the bots’ names are aligned with the Step
number in which they appear, rather than the order in which they
appear. Hence, this bot is named bot\_2\_random.py rather than
bot\_1\_random.py.
Start this script by adding the following highlighted lines. These lines
include a comment block that explains what this script will do and two
import statements that will import the packages this script will
ultimately need in order to function:
/AtariBot/bot_2_random.py
"""
Bot 2 -- Make a random, baseline agent for the SpaceInvaders game.
"""
import gym
import random
Add a main function. In this function, create the game environment —
SpaceInvaders-v0 — and then initialize the game using env.reset:
/AtariBot/bot_2_random.py
. . .
import gym
import random
def main():
env = gym.make('SpaceInvaders-v0')
env.reset()
Next, add an env.step function. This function can return the
following kinds of values:
state: The new state of the game, after applying the provided
action.
reward: The increase in score that the state incurs. By way of
example, this could be when a bullet has destroyed an alien, and the
score increases by 50 points. Then, reward = 50. In playing any
score-based game, the player’s goal is to maximize the score. This is
synonymous with maximizing the total reward.
done: Whether or not the episode has ended, which usually occurs
when a player has lost all lives.
info: Extraneous information that you’ll put aside for now.
You will use reward to count your total reward. You’ll also use done
to determine when the player dies, which will be when done returns
True.
Add the following game loop, which instructs the game to loop until
the player dies:
/AtariBot/bot_2_random.py
. . .
def main():
env = gym.make('SpaceInvaders-v0')
env.reset()
episode_reward = 0
while True:
action = env.action_space.sample()
_, reward, done, _ = env.step(action)
episode_reward += reward
if done:
print('Reward: %s' % episode_reward)
break
Finally, run the main function. Include a __name__ check to ensure
t h a t main only runs when you invoke it directly with python
bot_2_random.py. If you do not add the if check, main will always
be triggered when the Python file is executed, even when you import the
file. Consequently, it’s a good practice to place the code in a main
function, executed only when __name__ == '__main__'.
/AtariBot/bot_2_random.py
. . .
def main():
. . .
if done:
print('Reward %s' % episode_reward)
break
if **name** == '**main**':
main()
Save the file and exit the editor. If you’re using nano, do so by pressing
CTRL+X, Y, then ENTER. Then, run your script by typing:
python bot_2_random.py
Your program will output a number, akin to the following. Note that
each time you run the file you will get a different result:
Output
Making new env: SpaceInvaders-v0
Reward: 210.0
These random results present an issue. In order to produce work that
other researchers and practitioners can benefit from, your results and
trials must be reproducible. To correct this, reopen the script file:
nano bot_2_random.py
A f t e r import random,
add random.seed(0). A f t e r env =
gym.make('SpaceInvaders-v0'),
add env.seed(0). Together,
these lines “seed” the environment with a consistent starting point,
ensuring that the results will always be reproducible. Your final file will
match the following, exactly:
/AtariBot/bot_2_random.py
"""
Bot 2 -- Make a random, baseline agent for the SpaceInvaders game.
"""
import gym
import random
random.seed(0)
def main():
env = gym.make('SpaceInvaders-v0')
env.seed(0)
env.reset()
episode_reward = 0
while True:
action = env.action_space.sample()
_, reward, done, _ = env.step(action)
episode_reward += reward
if done:
print('Reward: %s' % episode_reward)
break
if **name** == '**main**':
main()
Save the file and close your editor, then run the script by typing the
following in your terminal:
python bot_2_random.py
This will output the following reward, exactly:
Output
Making new env: SpaceInvaders-v0
Reward: 555.0
This is your very first bot, although it’s rather unintelligent since it
doesn’t account for the surrounding environment when it makes
decisions. For a more reliable estimate of your bot’s performance, you
could have the agent run for multiple episodes at a time, reporting
rewards averaged across multiple episodes. To configure this, first reopen
the file:
nano bot_2_random.py
After random.seed(0), add the following highlighted line which
tells the agent to play the game for 10 episodes:
/AtariBot/bot_2_random.py
. . .
random.seed(0)
num_episodes = 10
. . .
Right after env.seed(0), start a new list of rewards:
/AtariBot/bot_2_random.py
. . .
env.seed(0)
rewards = []
. . .
Nest all code from env.reset() to the end of main() in a for loop,
iterating num_episodes times. Make sure to indent each line from
env.reset() to break by four spaces:
/AtariBot/bot_2_random.py
. . .
def main():
env = gym.make('SpaceInvaders-v0')
env.seed(0)
rewards = []
for _ in range(num_episodes):
env.reset()
episode_reward = 0
while True:
...
Right before break, currently the last line of the main game loop, add
the current episode’s reward to the list of all rewards:
/AtariBot/bot_2_random.py
. . .
if done:
print('Reward: %s' % episode_reward)
rewards.append(episode_reward)
break
. . .
At the end of the main function, report the average reward:
/AtariBot/bot_2_random.py
. . .
def main():
...
print('Reward: %s' % episode_reward)
break
print('Average reward: %.2f' % (sum(rewards) / len(rewards)))
. . .
Your file will now align with the following. Please note that the
following code block includes a few comments to clarify key parts of the
script:
/AtariBot/bot_2_random.py
"""
Bot 2 -- Make a random, baseline agent for the SpaceInvaders game.
"""
import gym
import random
random.seed(0)
# make results reproducible
num_episodes = 10
def main():
env = gym.make('SpaceInvaders-v0')
# create the game
env.seed(0)
# make results reproducible
rewards = []
for _ in range(num_episodes):
env.reset()
episode_reward = 0
while True:
action = env.action_space.sample()
_, reward, done, _ = env.step(action)
# random action
episode_reward += reward
if done:
print('Reward: %d' % episode_reward)
rewards.append(episode_reward)
break
print('Average reward: %.2f' % (sum(rewards) / len(rewards)))
if __name__ == '__main__':
main()
Save the file, exit the editor, and run the script:
python bot_2_random.py
This will print the following average reward, exactly:
Output
Making new env: SpaceInvaders-v0
. . .
Average reward: 163.50
We now have a more reliable estimate of the baseline score to beat. To
create a superior agent, though, you will need to understand the
framework for reinforcement learning. How can one make the abstract
notion of “decision-making” more concrete?
Understanding Reinforcement Learning
In any game, the player’s goal is to maximize their score. In this guide,
the player’s score is referred to as its reward. To maximize their reward,
the player must be able to refine its decision-making abilities. Formally, a
decision is the process of looking at the game, or observing the game’s
state, and picking an action. Our decision-making function is called a
policy; a policy accepts a state as input and “decides” on an action:
policy: state -> action
To build such a function, we will start with a specific set of algorithms
in reinforcement learning called Q-learning algorithms. To illustrate
these, consider the initial state of a game, which we’ll call state0: your
spaceship and the aliens are all in their starting positions. Then, assume
we have access to a magical “Q-table” which tells us how much reward
each action will earn:
STATE ACTION REWARD
state0 shoot
10
state0 right
3
state0 left
3
T h e shoot action will maximize your reward, as it results in the
reward with the highest value: 10. As you can see, a Q-table provides a
straightforward way to make decisions, based on the observed state:
policy: state -> look at Q-table, pick action with greatest reward
However, most games have too many states to list in a table. In such
cases, the Q-learning agent learns a Q-function instead of a Q-table. We
use this Q-function similarly to how we used the Q-table previously.
Rewriting the table entries as functions gives us the following:
Q(state0, shoot) = 10
Q(state0, right) = 3
Q(state0, left) = 3
Given a particular state, it’s easy for us to make a decision: we simply
look at each possible action and its reward, then take the action that
corresponds with the highest expected reward. Reformulating the earlier
policy more formally, we have:
policy: state -> argmax_{action} Q(state, action)
This satisfies the requirements of a decision-making function: given a
state in the game, it decides on an action. However, this solution depends
on knowing Q(state, action) for every state and action. To estimate
Q(state, action), consider the following:
1. Given many observations of an agent’s states, actions, and rewards,
one can obtain an estimate of the reward for every state and action
by taking a running average.
2. Space Invaders is a game with delayed rewards: the player is
rewarded when the alien is blown up and not when the player
shoots. However, the player taking an action by shooting is the true
impetus for the reward. Somehow, the Q-function must assign
(state0, shoot) a positive reward.
These two insights are codified in the following equations:
Q(state, action) = (1 - learning_rate) * Q(state, action) +
learning_rate * Q_target
Q_target = reward + discount_factor * max_{action'} Q(state', action')
These equations use the following definitions:
state: the state at current time step
action: the action taken at current time step
reward: the reward for current time step
state': the new state for next time step, given that we took action a
action': all possible actions
learning_rate: the learning rate
discount_factor: the discount factor, how much reward
“degrades” as we propagate it
For a complete explanation of these two equations, see this article on
Understanding Q-Learning.
With this understanding of reinforcement learning in mind, all that
remains is to actually run the game and obtain these Q-value estimates
for a new policy.

最后修改: 2025年06月25日星期三 12:19