Python机器学习项目: 步骤 6 — 为《太空入侵者》创建深度 Q-学习智能体

假设你完美地调整了之前 Q-学习算法的模型复杂度和样本复杂度，无论你选择的是神经网络还是最小二乘法。事实证明，这种不智能的 Q-学习智能体在更复杂的游戏上表现仍然很差，即使训练回合数特别高。本节将介绍两种可以提高性能的技术，然后你将测试一个使用这些技术训练的智能体。

第一个能够在没有任何人为干预的情况下持续调整其行为的通用智能体是由 DeepMind 的研究人员开发的，他们还训练他们的智能体玩各种 Atari 游戏。DeepMind 最初的深度 Q-学习 (DQN) 论文认识到两个重要问题：

相关状态 (Correlated states)：假设我们游戏在时间 0 的状态，我们称之为 s0。假设我们根据之前推导的规则更新 Q(s0)。现在，取时间 1 的状态，我们称之为 s1，并根据相同的规则更新 Q(s1)。请注意，游戏在时间 0 的状态与其在时间 1 的状态非常相似。例如，在《太空入侵者》中，外星人可能各自移动了一个像素。更简洁地说，s0 和 s1 非常相似。同样，我们也期望 Q(s0) 和 Q(s1) 非常相似，因此更新一个会影响另一个。这会导致 Q 值波动，因为对 Q(s0) 的更新实际上可能会抵消对 Q(s1) 的更新。更正式地说，s0 和 s1 是相关的。由于 Q 函数是确定性的，因此 Q(s1) 与 Q(s0) 相关。
Q-函数不稳定性 (Q-function instability)：回想一下，Q 函数是我们训练的模型和标签的来源。假设我们的标签是随机选择的值，它们真实地代表一个分布 L。每次我们更新 Q 时，我们都会改变 L，这意味着我们的模型正在尝试学习一个移动的目标。这是一个问题，因为我们使用的模型假设一个固定的分布。

为了对抗相关状态和不稳定的 Q-函数：

可以维护一个名为回放缓冲区 (replay buffer) 的状态列表。每个时间步，你将观察到的游戏状态添加到这个回放缓冲区中。你还从这个列表中随机采样一个状态子集，并在这些状态上进行训练。
DeepMind 的团队复制了 Q(s,a)。一个被称为 Qcurrent(s,a)，这是你更新的 Q 函数。你需要另一个用于后继状态的 Q 函数，Qtarget(s′,a′)，你不会更新它。回想一下 Qtarget(s′,a′) 用于生成你的标签。通过将 Qcurrent 与 Qtarget 分开并固定后者，你固定了标签采样所基于的分布。然后，你的深度学习模型可以花很短的时间学习这个分布。一段时间后，你再为新的 Qtarget 重新复制 Qcurrent。

你不会亲自实现这些，但你将加载使用这些解决方案训练的预训练模型 (pretrained models)。为此，创建一个新目录来存储这些模型的参数：

mkdir models

然后使用 wget 下载一个预训练的《太空入侵者》模型的参数：

wget http://models.tensorpack.com/OpenAIGym/SpaceInvaders-v0.tfmodel -P models

接下来，下载一个 Python 脚本，该脚本指定与你刚刚下载的参数相关的模型。请注意，这个预训练模型对输入有两个必要的约束需要牢记：

状态必须下采样，或将尺寸减小到 84 x 84。
输入由四个状态堆叠而成。

我们稍后将更详细地解决这些约束。现在，通过输入以下命令下载脚本：

wget https://github.com/alvinwan/bots-for-atari-games/raw/master/src/bot_6_a3c.py

你现在将运行这个预训练的《太空入侵者》智能体，看看它的表现如何。与我们过去使用的几个机器人不同，你将从头开始编写这个脚本。

创建一个新的脚本文件：

nano bot_6_dqn.py

通过添加一个标题注释，导入必要的实用程序，并开始主游戏循环来启动此脚本：

/AtariBot/bot_6_dqn.py

"""
Bot 6 - Fully featured deep q-learning network.
"""
import cv2
import gym
import numpy as np
import random
import tensorflow as tf
from bot_6_a3c import a3c_model

def main():
    pass # 占位符，稍后会填充
    
if __name__ == '__main__':
    main()

紧接在你的导入语句之后，设置随机种子以使你的结果可复现。此外，定义一个超参数 num_episodes，它将告诉脚本运行智能体的回合数：

/AtariBot/bot_6_dqn.py

. . .
import tensorflow as tf
from bot_6_a3c import a3c_model

random.seed(0) # make results reproducible
tf.set_random_seed(0)
num_episodes = 10 # 新增

def main():
. . .

在声明 num_episodes 两行后，定义一个 downsample 函数，它将所有图像下采样到 84 x 84 的大小。你将在将所有图像传递给预训练神经网络之前对其进行下采样，因为预训练模型是在 84 x 84 图像上训练的：

/AtariBot/bot_6_dqn.py

. . .
num_episodes = 10

def downsample(state): # 新增
    return cv2.resize(state, (84, 84), interpolation=cv2.INTER_LINEAR)[None]

def main():
. . .

在 main 函数的开头创建游戏环境，并播种环境以使结果可复现：

/AtariBot/bot_6_dqn.py

. . .
def main():
    env = gym.make('SpaceInvaders-v0') # create the game # 新增
    env.seed(0) # make results reproducible # 新增
. . .

紧接在环境种子之后，初始化一个空列表来保存奖励：

/AtariBot/bot_6_dqn.py

. . .
def main():
    env = gym.make('SpaceInvaders-v0') # create the game
    env.seed(0) # make results reproducible
    rewards = [] # 新增
. . .

使用你在本步骤开头下载的预训练模型参数初始化预训练模型：

/AtariBot/bot_6_dqn.py

. . .
def main():
    env = gym.make('SpaceInvaders-v0') # create the game
    env.seed(0) # make results reproducible
    rewards = []
    model = a3c_model(load='models/SpaceInvaders-v0.tfmodel') # 新增
. . .

接下来，添加一些行，指示脚本迭代 num_episodes 次以计算平均性能，并将每个回合的奖励初始化为 0。此外，添加一行以重置环境 (env.reset())，在此过程中收集新的初始状态，使用 downsample() 对此初始状态进行下采样，并使用 while 循环开始游戏循环：

/AtariBot/bot_6_dqn.py

. . .
def main():
    env = gym.make('SpaceInvaders-v0') # create the game
    env.seed(0) # make results reproducible
    rewards = []
    model = a3c_model(load='models/SpaceInvaders-v0.tfmodel')
    for _ in range(num_episodes): # 新增
        episode_reward = 0 # 新增
        states = [downsample(env.reset())] # 新增
        while True: # 新增
. . .

新的神经网络不再一次接受一个状态，而是一次接受四个状态。因此，你必须等到状态列表包含至少四个状态后才能应用预训练模型。在读取 while True: 的行下方添加以下行。这些行指示智能体在状态少于四个时采取随机行动，或在至少有四个状态时连接状态并将其传递给预训练模型：

/AtariBot/bot_6_dqn.py

. . .
while True:
    if len(states) < 4: # 新增
        action = env.action_space.sample() # 新增
    else: # 新增
        frames = np.concatenate(states[-4:], axis=3) # 新增
        action = np.argmax(model([frames])) # 新增
. . .

然后采取行动并更新相关数据。添加观察到的状态的下采样版本，并更新此回合的奖励：

/AtariBot/bot_6_dqn.py

. . .
while True:
    ...
    action = np.argmax(model([frames])) # 这行是上面的 `else` 块的最后一行
    state, reward, done, _ = env.step(action) # 新增
    states.append(downsample(state)) # 新增
    episode_reward += reward # 新增
. . .

接下来，添加以下行，检查回合是否完成，如果完成，则打印回合的总奖励，并修改所有结果列表并提前跳出 while 循环：

/AtariBot/bot_6_dqn.py

. . .
while True:
    ...
    episode_reward += reward
    if done: # 新增
        print('Reward: %d' % episode_reward) # 新增
        rewards.append(episode_reward) # 新增
        break # 新增
. . .

在 while 循环和 for 循环之外，打印平均奖励。将其放在 main 函数的末尾：

/AtariBot/bot_6_dqn.py

def main():
    ...
    break # 这行是上面 `if done:` 块的最后一行
    print('Average reward: %.2f' % (sum(rewards) / len(rewards))) # 新增

检查你的文件是否与以下内容匹配：

/AtariBot/bot_6_dqn.py

"""
Bot 6 - Fully featured deep q-learning network.
"""
import cv2
import gym
import numpy as np
import random
import tensorflow as tf
from bot_6_a3c import a3c_model

random.seed(0)  # make results reproducible
tf.set_random_seed(0)
num_episodes = 10

def downsample(state):
    return cv2.resize(state, (84, 84), interpolation=cv2.INTER_LINEAR)[None]

def main():
    env = gym.make('SpaceInvaders-v0')  # create the game
    env.seed(0)  # make results reproducible
    rewards = []
    model = a3c_model(load='models/SpaceInvaders-v0.tfmodel')

    for _ in range(num_episodes):
        episode_reward = 0
        states = [downsample(env.reset())]
        while True:
            if len(states) < 4:
                action = env.action_space.sample()
            else:
                frames = np.concatenate(states[-4:], axis=3)
                action = np.argmax(model([frames]))

            state, reward, done, _ = env.step(action)
            states.append(downsample(state))
            episode_reward += reward

            if done:
                print('Reward: %d' % episode_reward)
                rewards.append(episode_reward)
                break
    print('Average reward: %.2f' % (sum(rewards) / len(rewards)))

if __name__ == '__main__':
    main()

保存文件并退出编辑器。然后，运行脚本：

python bot_6_dqn.py

你的输出将以以下内容结束：

Output
. . .
Reward: 1230
Reward: 4510
Reward: 1860
Reward: 2555
Reward: 515
Reward: 1830
Reward: 4100
Reward: 4350
Reward: 1705
Reward: 4905
Average reward: 2756.00

将其与第一个脚本的结果进行比较，你在其中运行了一个《太空入侵者》的随机智能体。在那种情况下，平均奖励只有大约 150，这意味着这个结果好二十多倍。然而，你只运行了三次代码，因为它相当慢，而且三次的平均值并不是一个可靠的指标。运行 10 次，平均值为 2756；运行 100 次，平均值约为 2500。只有通过这些平均值，你才能放心地得出结论，你的智能体确实性能提高了一个数量级，并且你现在有一个相当擅长玩《太空入侵者》的智能体。

然而，回想上一节中提出的关于样本复杂度 (sample complexity) 的问题。事实证明，这个《太空入侵者》智能体需要数百万个样本进行训练。事实上，这个智能体需要在四块 Titan X GPU 上运行 24 小时才能训练到目前这个水平；换句话说，它需要大量的计算才能充分训练。你能用更少的样本训练一个类似的高性能智能体吗？前面的步骤应该为你提供了足够的知识来开始探索这个问题。使用更简单的模型和每个偏差-方差权衡，这可能是可能的。

Step 6 — Creating a Deep Q-learning Agent for Space Invaders
Say you tuned the previous Q-learning algorithm’s model complexity
an d sample complexity perfectly, regardless of whether you picked a
neural network or least squares method. As it turns out, this unintelligent
Q-learning agent still performs poorly on more complex games, even
with an especially high number of training episodes. This section will
cover two techniques that can improve performance, then you will test
an agent that was trained using these techniques.
The first general-purpose agent able to continually adapt its behavior
without any human intervention was developed by the researchers at
DeepMind, who also trained their agent to play a variety of Atari games.
DeepMind’s original deep Q-learning (DQN) paper recognized two
important issues:
1. Correlated states: Take the state of our game at time 0, which we will
call s0. Say we update Q(s0), according to the rules we derived
previously. Now, take the state at time 1, which we call s1, and
update Q(s1) according to the same rules. Note that the game’s state
at time 0 is very similar to its state at time 1. In Space Invaders, for
example, the aliens may have moved by one pixel each. Said more
succinctly, s0 and s1 are very similar. Likewise, we also expect Q(s0)
and Q(s1) to be very similar, so updating one affects the other. This
leads to fluctuating Q values, as an update to Q(s0) may in fact
counter the update to Q(s1). More formally, s0 and s1 are correlated.
Since the Q-function is deterministic, Q(s1) is correlated with Q(s0).
2. Q-function instability: Recall that the Q function is both the model
we train and the source of our labels. Say that our labels are
randomly-selected values that truly represent a distribution, L.
Every time we update Q, we change L, meaning that our model is
trying to learn a moving target. This is an issue, as the models we
use assume a fixed distribution.
To combat correlated states and an unstable Q-function:
1. One could keep a list of states called a replay buffer. Each time step,
you add the game state that you observe to this replay buffer. You
also randomly sample a subset of states from this list, and train on
those states.
2. The team at DeepMind duplicated Q(s, a). One is called Q_current(s,
a), which is the Q-function you update. You need another Q-function
for successor states, Q_target(s’, a’), which you won’t update. Recall
Q_target(s’, a’) is used to generate your labels. By separating
Q_current
from Q_target and fixing the latter, you fix the
distribution your labels are sampled from. Then, your deep learning
model can spend a short period learning this distribution. After a
period of time, you then re-duplicate Q_current for a new Q_target.
You won’t implement these yourself, but you will load pretrained
models that trained with these solutions. To do this, create a new
directory where you will store these models’ parameters:
mkdir models
Then use wget to download a pretrained Space Invaders model’s
parameters:
wget http://models.tensorpack.com/OpenAIGym/SpaceInvaders-v0.tfmodel -P
models
Next, download a Python script that specifies the model associated
with the parameters you just downloaded. Note that this pretrained
model has two constraints on the input that are necessary to keep in
mind:
The states must be downsampled, or reduced in size, to 84 x 84.
The input consists of four states, stacked.
We will address these constraints in more detail later on. For now,
download the script by typing:
wget https://github.com/alvinwan/bots-for-atari-
games/raw/master/src/bot_6_a3c.py
You will now run this pretrained Space Invaders agent to see how it
performs. Unlike the past few bots we’ve used, you will write this script
from scratch.
Create a new script file:
nano bot_6_dqn.py
Begin this script by adding a header comment, importing the necessary
utilities, and beginning the main game loop:
/AtariBot/bot_6_dqn.py
"""
Bot 6 - Fully featured deep q-learning network.
"""
import cv2
import gym
import numpy as np
import random
import tensorflow as tf
from bot_6_a3c import a3c_model
def main():
if **name** == '**main**':
main()
Directly after your imports, set random seeds to make your results
reproducible. Also, define a hyperparameter num_episodes which will
tell the script how many episodes to run the agent for:
/AtariBot/bot_6_dqn.py
. . .
import tensorflow as tf
from bot_6_a3c import a3c_model
random.seed(0) # make results reproducible
tf.set_random_seed(0)
num_episodes = 10
def main():
. . .
Two lines after declaring num_episodes, define a downsample
function that downsamples all images to a size of 84 x 84. You will
downsample all images before passing them into the pretrained neural
network, as the pretrained model was trained on 84 x 84 images:
/AtariBot/bot_6_dqn.py
. . .
num_episodes = 10
def downsample(state):
return cv2.resize(state, (84, 84), interpolation=cv2.INTER_LINEAR)[None]
def main():
. . .
Create the game environment at the start of your main function and
seed the environment so that the results are reproducible:
/AtariBot/bot_6_dqn.py
. . .
def main():
env = gym.make('SpaceInvaders-v0') # create the game
env.seed(0) # make results reproducible
. . .
Directly after the environment seed, initialize an empty list to hold the
rewards:
/AtariBot/bot_6_dqn.py
. . .
def main():
env = gym.make('SpaceInvaders-v0') # create the game
env.seed(0) # make results reproducible
rewards = []
. . .
Initialize the pretrained model with the pretrained model parameters
that you downloaded at the beginning of this step:
/AtariBot/bot_6_dqn.py
. . .
def main():
env = gym.make('SpaceInvaders-v0') # create the game
env.seed(0) # make results reproducible
rewards = []
model = a3c_model(load='models/SpaceInvaders-v0.tfmodel')
. . .
Next, add some lines telling the script to iterate for num_episodes
times to compute average performance and initialize each episode’s
reward to 0. Additionally, add a line to reset the environment
(env.reset()),
collecting
the
new initial state in the process,
downsample this initial state with downsample(), and start the game
loop using a while loop:
/AtariBot/bot_6_dqn.py
. . .
def main():
env = gym.make('SpaceInvaders-v0') # create the game
env.seed(0) # make results reproducible
rewards = []
model = a3c*model(load='models/SpaceInvaders-v0.tfmodel')
for * in range(num_episodes):
episode_reward = 0
states = [downsample(env.reset())]
while True:
. . .
Instead of accepting one state at a time, the new neural network
accepts four states at a time. As a result, you must wait until the list of
states contains at least four states before applying the pretrained
model. Add the following lines below the line reading while True:.
These tell the agent to take a random action if there are fewer than four
states or to concatenate the states and pass it to the pretrained model if
there are at least four:
/AtariBot/bot_6_dqn.py
. . .
while True:
if len(states) < 4:
action = env.action_space.sample()
else:
frames = np.concatenate(states[-4:], axis=3)
action = np.argmax(model([frames]))
. . .
Then take an action and update the relevant data. Add a downsampled
version of the observed state, and update the reward for this episode:
/AtariBot/bot_6_dqn.py
. . .
while True:
...
action = np.argmax(model([frames]))
state, reward, done, _ = env.step(action)
states.append(downsample(state))
episode_reward += reward
. . .
Next, add the following lines which check whether the episode is done
and, if it is, print the episode’s total reward and amend the list of all
results and break the while loop early:
/AtariBot/bot_6_dqn.py
. . .
while True:
...
episode_reward += reward
if done:
print('Reward: %d' % episode_reward)
rewards.append(episode_reward)
break
. . .
Outside of the while and for loops, print the average reward. Place
this at the end of your main function:
/AtariBot/bot_6_dqn.py
def main():
...
break
print('Average reward: %.2f' % (sum(rewards) / len(rewards)))
Check that your file matches the following:
/AtariBot/bot_6_dqn.py
"""
Bot 6 - Fully featured deep q-learning network.
"""
import cv2
import gym
import numpy as np
import random
import tensorflow as tf
from bot_6_a3c import a3c_model
random.seed(0)
# make results reproducible
tf.set_random_seed(0)
num_episodes = 10
def downsample(state):
return cv2.resize(state, (84, 84), interpolation=cv2.INTER_LINEAR)
[None]
def main():
env = gym.make('SpaceInvaders-v0')
# create the game
env.seed(0)
# make results reproducible
rewards = []
model = a3c_model(load='models/SpaceInvaders-v0.tfmodel')
for _ in range(num_episodes):
episode_reward = 0
states = [downsample(env.reset())]
while True:
if len(states) < 4:
action = env.action_space.sample()
else:
frames = np.concatenate(states[-4:], axis=3)
action = np.argmax(model([frames]))
state, reward, done, _ = env.step(action)
states.append(downsample(state))
episode_reward += reward
if done:
print('Reward: %d' % episode_reward)
rewards.append(episode_reward)
break
print('Average reward: %.2f' % (sum(rewards) / len(rewards)))
if __name__ == '__main__':
main()
Save the file and exit your editor. Then, run the script:
python bot_6_dqn.py
Your output will end with the following:
Output
. . .
Reward: 1230
Reward: 4510
Reward: 1860
Reward: 2555
Reward: 515
Reward: 1830
Reward: 4100
Reward: 4350
Reward: 1705
Reward: 4905
Average reward: 2756.00
Compare this to the result from the first script, where you ran a
random agent for Space Invaders. The average reward in that case was
only about 150, meaning this result is over twenty times better. However,
you only ran your code for three episodes, as it’s fairly slow, and the
average of three episodes is not a reliable metric. Running this over 10
episodes, the average is 2756; over 100 episodes, the average is around
2500. Only with these averages can you comfortably conclude that your
agent is indeed performing an order of magnitude better, and that you
now have an agent that plays Space Invaders reasonably well.
However, recall the issue that was raised in the previous section
regarding sample complexity. As it turns out, this Space Invaders agent
takes millions of samples to train. In fact, this agent required 24 hours on
four Titan X GPUs to train up to this current level; in other words, it took
a significant amount of compute to train it adequately. Can you train a
similarly high-performing agent with far fewer samples? The previous
steps should arm you with enough knowledge to begin exploring this
question. Using far simpler models and per bias-variance tradeoffs, it
may be possible.

最后修改: 2025年06月25日星期三 12:40