Python机器学习项目: 步骤 3 — 为《冰冻湖》创建简单的 Q-学习智能体

现在你已经有了一个基线智能体，你可以开始创建新的智能体并将它们与原始智能体进行比较。在此步骤中，你将创建一个使用 Q-学习的智能体，这是一种强化学习技术，用于教导智能体在给定特定状态下应采取何种行动。这个智能体将玩一个新游戏：《冰冻湖》。Gym 网站上对这款游戏的设置描述如下：

冬天来了。你和你的朋友在公园里扔飞盘，结果你把飞盘扔到了湖中央。湖水大部分结冰了，但有几个地方的冰融化了。如果你踩进这些洞中的任何一个，你就会掉进冰冷的水中。此时，飞盘严重短缺，因此你必须穿越湖面并取回飞盘。然而，冰面很滑，所以你不会总是按照你想要的方向移动。

冰面使用类似以下网格的结构来描述：

SFFF
FHFH
FFFH
HFFG

（S：起点，安全）

（F：冰面，安全）

（H：洞，掉进去就会万劫不复）

（G：目标，飞盘所在的位置）

玩家从左上方（S 表示）开始，努力到达右下方的目标（G 表示）。可用的行动是右、左、上和下，到达目标会获得 1 分。有许多洞（H 表示），掉入其中任何一个都会立即获得 0 分。

在本节中，你将实现一个简单的 Q-学习智能体。利用你之前学到的知识，你将创建一个在探索 (exploration) 和利用 (exploitation) 之间进行权衡的智能体。在这种情况下，探索意味着智能体随机行动，而利用意味着它使用其 Q 值来选择它认为是最佳的行动。你还将创建一个表格来保存 Q 值，并随着智能体的行动和学习而逐步更新它。

复制你步骤 2 中的脚本：

Bash

cp bot_2_random.py bot_3_q_table.py

然后打开这个新文件进行编辑：

Bash

nano bot_3_q_table.py

首先更新文件顶部的注释块，该注释描述了脚本的用途。因为这只是一个注释，所以此更改对于脚本的正常运行不是必需的，但它有助于记录脚本的功能：

/AtariBot/bot_3_q_table.py

Python

"""
Bot 3 -- Build simple q-learning agent for FrozenLake
"""
. . .

在对脚本进行功能修改之前，你需要导入 numpy 以利用其线性代数实用程序。在 import gym 的正下方，添加高亮行：

/AtariBot/bot_3_q_table.py

Python

"""
Bot 3 -- Build simple q-learning agent for FrozenLake
"""
import gym
import numpy as np
import random
random.seed(0) # make results reproducible
. . .

在 random.seed(0) 下方，为 numpy 添加一个种子：

/AtariBot/bot_3_q_table.py

Python

. . .
import random
random.seed(0) # make results reproducible
np.random.seed(0)
. . .

接下来，使游戏状态可访问。将 env.reset() 行更新为以下内容，这将游戏的初始状态存储在变量 state 中：

/AtariBot/bot_3_q_table.py

Python

. . .
for _ in range(num_episodes):
    state = env.reset()
. . .

将 env.step(...) 行更新为以下内容，这将存储下一个状态 state2。你需要当前状态和下一个状态 state2 来更新 Q 函数。

/AtariBot/bot_3_q_table.py

Python

. . .
while True:
    action = env.action_space.sample()
    state2, reward, done, _ = env.step(action)
. . .

在 episode_reward += reward 之后，添加一行更新变量 state。这将使 state 变量在下一次迭代中保持更新，因为你会期望 state 反映当前状态：

/AtariBot/bot_3_q_table.py

Python

. . .
while True:
    . . .
    episode_reward += reward
    state = state2
    if done:
        . . .

在 if done 块中，删除打印每个回合奖励的打印语句。相反，你将输出多个回合的平均奖励。然后 if done 块将如下所示：

/AtariBot/bot_3_q_table.py

Python

. . .
if done:
    rewards.append(episode_reward)
    break
. . .

进行这些修改后，你的游戏循环将与以下内容匹配：

/AtariBot/bot_3_q_table.py

Python

. . .
for _ in range(num_episodes):
    state = env.reset()
    episode_reward = 0
    while True:
        action = env.action_space.sample()
        state2, reward, done, _ = env.step(action)
        episode_reward += reward
        state = state2
        if done:
            rewards.append(episode_reward) # 括号错误，应为 rewards.append(episode_reward)
            break
. . .

接下来，添加智能体在探索和利用之间进行权衡的能力。在你的主游戏循环（以 for... 开始）之前，创建 Q 值表：

/AtariBot/bot_3_q_table.py

Python

. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
for _ in range(num_episodes):
. . .

然后，重写 for 循环以公开回合编号：

/AtariBot/bot_3_q_table.py

Python

. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
for episode in range(1, num_episodes + 1):
. . .

在 while True: 内部游戏循环中，创建噪声 (noise)。噪声，即无意义的随机数据，有时在训练深度神经网络时引入，因为它可以提高模型的性能和准确性。请注意，噪声越高，Q[state, :] 中的值就越不重要。因此，噪声越高，智能体越有可能独立于其对游戏的了解而行动。换句话说，更高的噪声鼓励智能体探索随机动作：

/AtariBot/bot_3_q_table.py

Python

. . .
while True:
    noise = np.random.random((1, env.action_space.n)) / \
        (episode**2.)
    action = env.action_space.sample()
. . .

请注意，随着回合数 episode 的增加，噪声量会呈二次方下降：随着时间的推移，智能体探索得越来越少，因为它可以相信自己对游戏奖励的评估并开始利用其知识。

更新 action 行，让你的智能体根据 Q 值表选择动作，并内置一些探索：

/AtariBot/bot_3_q_table.py

Python

. . .
noise = np.random.random((1, env.action_space.n)) / \
    (episode**2.)
action = np.argmax(Q[state, :] + noise)
state2, reward, done, _ = env.step(action)
. . .

你的主游戏循环将与以下内容匹配：

/AtariBot/bot_3_q_table.py

Python

. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
for episode in range(1, num_episodes + 1):
    state = env.reset()
    episode_reward = 0
    while True:
        noise = np.random.random((1, env.action_space.n)) / \
            (episode**2.)
        action = np.argmax(Q[state, :] + noise)
        state2, reward, done, _ = env.step(action)
        episode_reward += reward
        state = state2
        if done:
            rewards.append(episode_reward)
            break
. . .

接下来，你将使用贝尔曼更新方程 (Bellman update equation) 更新 Q 值表，这是一个在机器学习中广泛使用的方程，用于在给定环境中找到最优策略。

贝尔曼方程包含了与此项目高度相关的两个思想。首先，从特定状态多次采取特定动作将导致该状态和动作相关的 Q 值的一个良好估计。为此，你将增加此机器人必须经历的回合数，以返回更强的 Q 值估计。其次，奖励必须随着时间传播，以便原始动作被赋予非零奖励。这个想法在具有延迟奖励的游戏中最清楚；例如，在《太空入侵者》中，玩家在外星人被炸毁时获得奖励，而不是在玩家射击时获得奖励。然而，玩家射击才是奖励的真正动力。同样，Q 函数必须为 (state0,shoot) 分配一个正奖励。

首先，将 num_episodes 更新为 4000：

/AtariBot/bot_3_q_table.py

Python

. . .
np.random.seed(0)
num_episodes = 4000
. . .

然后，以两个更多变量的形式将必要的超参数添加到文件顶部：

/AtariBot/bot_3_q_table.py

Python

. . .
num_episodes = 4000
discount_factor = 0.8
learning_rate = 0.9
. . .

在包含 env.step(...) 的行之后，计算新的目标 Q 值 Qtarget：

/AtariBot/bot_3_q_table.py

Python

. . .
state2, reward, done, _ = env.step(action)
Qtarget = reward + discount_factor * np.max(Q[state2, :])
episode_reward += reward
. . .

在 Qtarget 正下方的行，使用旧 Q 值和新 Q 值的加权平均值更新 Q 值表：

/AtariBot/bot_3_q_table.py

Python

. . .
Qtarget = reward + discount_factor * np.max(Q[state2, :])
Q[state, action] = (
    1 - learning_rate
) * Q[state, action] + learning_rate * Qtarget
episode_reward += reward
. . .

检查你的主游戏循环现在是否与以下内容匹配：

/AtariBot/bot_3_q_table.py

Python

. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
for episode in range(1, num_episodes + 1):
    state = env.reset()
    episode_reward = 0
    while True:
        noise = np.random.random((1, env.action_space.n)) / \
            (episode**2.)
        action = np.argmax(Q[state, :] + noise)
        state2, reward, done, _ = env.step(action)
        Qtarget = reward + discount_factor * np.max(Q[state2, :])
        Q[state, action] = (
            1 - learning_rate
        ) * Q[state, action] + learning_rate * Qtarget
        episode_reward += reward
        state = state2
        if done:
            rewards.append(episode_reward)
            break
. . .

我们训练智能体的逻辑现在已经完成。剩下的就是添加报告机制。

尽管 Python 不强制严格的类型检查，但请为你的函数声明添加类型以保持代码整洁。在文件顶部，在第一行 import gym 之前，导入 List 类型：

/AtariBot/bot_3_q_table.py

Python

. . .
from typing import List
import gym
. . .

在 learning_rate = 0.9 之后，在 main 函数之外，声明报告的间隔和格式：

/AtariBot/bot_3_q_table.py

Python

. . .
learning_rate = 0.9
report_interval = 500
report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average: %.2f ' \
         '(Episode %d)'
def main():
. . .

在 main 函数之前，添加一个新函数，该函数将使用所有奖励列表填充此报告字符串：

/AtariBot/bot_3_q_table.py

Python

. . .
report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average: %.2f ' \
         '(Episode %d)'

def print_report(rewards: List, episode: int):
    """Print rewards report for current episode
    - Average for last 100 episodes
    - Best 100-episode average across all time
    - Average for all episodes across time
    """
    print(report % (
        np.mean(rewards[-100:]),
        max([np.mean(rewards[i:i+100]) for i in range(len(rewards) - 100)]),
        np.mean(rewards),
        episode))

def main():
. . .

将游戏改为《冰冻湖》而不是《太空入侵者》：

/AtariBot/bot_3_q_table.py

Python

. . .
def main():
    env = gym.make('FrozenLake-v0') # create the game
. . .

在 rewards.append(...) 之后，打印最后 100 个回合的平均奖励，并打印所有回合的平均奖励：

/AtariBot/bot_3_q_table.py

Python

. . .
if done:
    rewards.append(episode_reward)
    if episode % report_interval == 0:
        print_report(rewards, episode)
. . .

在 main() 函数的末尾，再次报告两个平均值。通过将读取 print('Average reward: %.2f' % (sum(rewards) / len(rewards))) 的行替换为以下高亮行来完成此操作：

/AtariBot/bot_3_q_table.py

Python

. . .
def main():
    ...
    break
    print_report(rewards, -1)
. . .

最后，你已经完成了你的 Q-学习智能体。检查你的脚本是否与以下内容一致：

/AtariBot/bot_3_q_table.py

Python

"""
Bot 3 -- Build simple q-learning agent for FrozenLake
"""
from typing import List
import gym
import numpy as np
import random

random.seed(0)  # make results reproducible
np.random.seed(0)  # make results reproducible

num_episodes = 4000
discount_factor = 0.8
learning_rate = 0.9
report_interval = 500
report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average: %.2f ' \
         '(Episode %d)'

def print_report(rewards: List, episode: int):
    """Print rewards report for current episode
    - Average for last 100 episodes
    - Best 100-episode average across all time
    - Average for all episodes across time
    """
    print(report % (
        np.mean(rewards[-100:]),
        max([np.mean(rewards[i:i+100]) for i in range(len(rewards) - 100)]),
        np.mean(rewards),
        episode))

def main():
    env = gym.make('FrozenLake-v0')  # create the game
    env.seed(0)  # make results reproducible
    rewards = []
    Q = np.zeros((env.observation_space.n, env.action_space.n))

    for episode in range(1, num_episodes + 1):
        state = env.reset()
        episode_reward = 0
        while True:
            noise = np.random.random((1, env.action_space.n)) / \
                (episode**2.)
            action = np.argmax(Q[state, :] + noise)
            state2, reward, done, _ = env.step(action)

            Qtarget = reward + discount_factor * np.max(Q[state2, :])
            Q[state, action] = (
                1 - learning_rate
            ) * Q[state, action] + learning_rate * Qtarget

            episode_reward += reward
            state = state2

            if done:
                rewards.append(episode_reward)
                if episode % report_interval == 0:
                    print_report(rewards, episode)
                break
    print_report(rewards, -1)

if __name__ == '__main__':
    main()

保存文件，退出编辑器，然后运行脚本：

Bash

python bot_3_q_table.py

你的输出将与以下内容匹配：

Output
100-ep Average: 0.11 . Best 100-ep Average: 0.12 . Average: 0.03 (Episode 500)
100-ep Average: 0.25 . Best 100-ep Average: 0.24 . Average: 0.09 (Episode 1000)
100-ep Average: 0.39 . Best 100-ep Average: 0.48 . Average: 0.19 (Episode 1500)
100-ep Average: 0.43 . Best 100-ep Average: 0.55 . Average: 0.25 (Episode 2000)
100-ep Average: 0.44 . Best 100-ep Average: 0.55 . Average: 0.29 (Episode 2500)
100-ep Average: 0.64 . Best 100-ep Average: 0.68 . Average: 0.32 (Episode 3000)
100-ep Average: 0.63 . Best 100-ep Average: 0.71 . Average: 0.36 (Episode 3500)
100-ep Average: 0.56 . Best 100-ep Average: 0.78 . Average: 0.40 (Episode 4000)
100-ep Average: 0.56 . Best 100-ep Average: 0.78 . Average: 0.40 (Episode -1)

你现在有了你的第一个非平凡的游戏机器人，但让我们把这个 0.78 的平均奖励放到一个更大的背景中来看。根据 Gym 《冰冻湖》页面，“解决”游戏意味着达到 0.78 的 100 回合平均值。非正式地说，“解决”意味着“玩游戏玩得非常好”。尽管不是在创纪录的时间内，但 Q-表智能体能够在 4000 个回合内解决《冰冻湖》。

然而，游戏可能会更复杂。在这里，你使用了一个表格来存储所有 144 种可能的状态，但考虑一下井字游戏，它有 19,683 种可能的状态。同样，考虑**《太空入侵者》，其中有太多可能的状态无法计数**。当游戏变得越来越复杂时，Q-表是不可持续的。因此，你需要某种方法来近似 Q-表。在下一步中继续实验时，你将设计一个可以接受状态和动作作为输入并输出 Q 值的函数。

你已经成功创建了一个 Q-学习智能体并初步了解了它在《冰冻湖》游戏中的表现。现在你准备好探索如何处理更复杂的游戏，使用 Q-函数而不是 Q-表了吗？

Step 3 — Creating a Simple Q-learning Agent for Frozen Lake
Now that you have a baseline agent, you can begin creating new agents
and compare them against the original. In this step, you will create an
agent that uses Q-learning, a reinforcement learning technique used to
teach an agent which action to take given a certain state. This agent will
play a new game, FrozenLake. The setup for this game is described as
follows on the Gym website:
Winter is here. You and your friends were tossing around a frisbee at the
park when you made a wild throw that left the frisbee out in the middle of
the lake. The water is mostly frozen, but there are a few holes where the ice
has melted. If you step into one of those holes, you’ll fall into the freezing
water. At this time, there’s an international frisbee shortage, so it’s
absolutely imperative that you navigate across the lake and retrieve the disc.
However, the ice is slippery, so you won’t always move in the direction you
intend.
The surface is described using a grid like the following:
SFFF
FHFH
FFFH
HFFG
(S: starting point, safe)
(F: frozen surface, safe)
(H: hole, fall to your doom)
(G: goal, where the frisbee is located)
The player starts at the top left, denoted by S, and works its way to the
goal at the bottom right, denoted by G. The available actions are right,
left, up, and down, and reaching the goal results in a score of 1. There are
a number of holes, denoted H, and falling into one immediately results in
a score of 0.
In this section, you will implement a simple Q-learning agent. Using
what you’ve learned previously, you will create an agent that trades off
between exploration and exploitation. In this context, exploration means
the agent acts randomly, and exploitation means it uses its Q-values to
choose what it believes to be the optimal action. You will also create a
table to hold the Q-values, updating it incrementally as the agent acts
and learns.
Make a copy of your script from Step 2:
cp bot_2_random.py bot_3_q_table.py
Then open up this new file for editing:
nano bot_3_q_table.py
Begin by updating the comment at the top of the file that describes the
script’s purpose. Because this is only a comment, this change isn’t
necessary for the script to function properly, but it can be helpful for
keeping track of what the script does:
/AtariBot/bot_3_q_table.py
"""
Bot 3 -- Build simple q-learning agent for FrozenLake
"""
. . .
Before you make functional modifications to the script, you will need
to import numpy for its linear algebra utilities. Right underneath import
gym, add the highlighted line:
/AtariBot/bot_3_q_table.py
"""
Bot 3 -- Build simple q-learning agent for FrozenLake
"""
import gym
import numpy as np
import random
random.seed(0) # make results reproducible
. . .
Underneath random.seed(0), add a seed for numpy:
/AtariBot/bot_3_q_table.py
. . .
import random
random.seed(0) # make results reproducible
np.random.seed(0)
. . .
Next, make the game states accessible. Update the env.reset() line
to say the following, which stores the initial state of the game in the
variable state:
/AtariBot/bot_3_q_table.py
. . .
for \_ in range(num_episodes):
state = env.reset()
. . .
Update the env.step(...) line to say the following, which stores
the next state, state2. You will need both the current state and the
next one — state2 — to update the Q-function.
/AtariBot/bot_3_q_table.py
. . .
while True:
action = env.action_space.sample()
state2, reward, done, _ = env.step(action)
. . .
A f t e r episode_reward += reward, add a line updating the
variable state. This keeps the variable state updated for the next
iteration, as you will expect state to reflect the current state:
/AtariBot/bot_3_q_table.py
. . .
while True:
. . .
episode_reward += reward
state = state2
if done:
. . .
In the if done block, delete the print statement which prints the
reward for each episode. Instead, you’ll output the average reward over
many episodes. The if done block will then look like this:
/AtariBot/bot_3_q_table.py
. . .
if done:
rewards.append(episode_reward)
break
. . .
After these modifications your game loop will match the following:
/AtariBot/bot_3_q_table.py
. . .
for _ in range(num_episodes):
state = env.reset()
episode_reward = 0
while True:
action = env.action_space.sample()
state2, reward, done, _ = env.step(action)
episode_reward += reward
state = state2
if done:
rewards.append(episode_reward))
break
. . .
Next, add the ability for the agent to trade off between exploration and
exploitation. Right before your main game loop (which starts with
for...), create the Q-value table:
/AtariBot/bot_3_q_table.py
. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
for _ in range(num_episodes):
. . .
Then, rewrite the for loop to expose the episode number:
/AtariBot/bot_3_q_table.py
. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
for episode in range(1, num_episodes + 1):
. . .
Inside the while True: inner game loop, create noise. Noise, or
meaningless, random data, is sometimes introduced when training deep
neural networks because it can improve both the performance and the
accuracy of the model. Note that the higher the noise, the less the values
in Q[state, :] matter. As a result, the higher the noise, the more likely
that the agent acts independently of its knowledge of the game. In other
words, higher noise encourages the agent to explore random actions:
/AtariBot/bot_3_q_table.py
. . .
while True:
noise = np.random.random((1, env.action_space.n)) /
(episode**2.)
action = env.action_space.sample()
. . .
Note that as episodes increases, the amount of noise decreases
quadratically: as time goes on, the agent explores less and less because it
can trust its own assessment of the game’s reward and begin to exploit its
knowledge.
Update the action line to have your agent pick actions according to
the Q-value table, with some exploration built in:
/AtariBot/bot_3_q_table.py
. . .
noise = np.random.random((1, env.action_space.n)) /
(episode**2.)
action = np.argmax(Q[state, :] + noise)
state2, reward, done, _ = env.step(action)
. . .
Your main game loop will then match the following:
/AtariBot/bot_3_q_table.py
. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
for episode in range(1, num_episodes + 1):
state = env.reset()
episode_reward = 0
while True:
noise = np.random.random((1, env.action_space.n)) /
(episode**2.)
action = np.argmax(Q[state, :] + noise)
state2, reward, done, _ = env.step(action)
episode_reward += reward
state = state2
if done:
rewards.append(episode_reward)
break
. . .
Next, you will update your Q-value table using the Bellman update
equation, an equation widely used in machine learning to find the
optimal policy within a given environment.
The Bellman equation incorporates two ideas that are highly relevant
to this project. First, taking a particular action from a particular state
many times will result in a good estimate for the Q-value associated with
that state and action. To this end, you will increase the number of
episodes this bot must play through in order to return a stronger Q-value
estimate. Second, rewards must propagate through time, so that the
original action is assigned a non-zero reward. This idea is clearest in
games with delayed rewards; for example, in Space Invaders, the player
is rewarded when the alien is blown up and not when the player shoots.
However, the player shooting is the true impetus for a reward. Likewise,
the Q-function must assign (state0, shoot) a positive reward.
First, update num_episodes to equal 4000:
/AtariBot/bot_3_q_table.py
. . .
np.random.seed(0)
num_episodes = 4000
. . .
Then, add the necessary hyperparameters to the top of the file in the
form of two more variables:
/AtariBot/bot_3_q_table.py
. . .
num_episodes = 4000
discount_factor = 0.8
learning_rate = 0.9
. . .
Compute the new target Q-value, right after the line containing
env.step(...):
/AtariBot/bot_3_q_table.py
. . .
state2, reward, done, _ = env.step(action)
Qtarget = reward + discount_factor * np.max(Q[state2, :])
episode_reward += reward
. . .
On the line directly after Qtarget, update the Q-value table using a
weighted average of the old and new Q-values:
/AtariBot/bot_3_q_table.py
. . .
Qtarget = reward + discount_factor * np.max(Q[state2, :])
Q[state, action] = (
1-learning_rate
) * Q[state, action] + learning_rate * Qtarget
episode_reward += reward
. . .
Check that your main game loop now matches the following:
/AtariBot/bot_3_q_table.py
. . .
Q = np.zeros((env.observation_space.n, env.action_space.n))
for episode in range(1, num_episodes + 1):
state = env.reset()
episode_reward = 0
while True:
noise = np.random.random((1, env.action_space.n)) /
(episode**2.)
action = np.argmax(Q[state, :] + noise)
state2, reward, done, _ = env.step(action)
Qtarget = reward + discount_factor * np.max(Q[state2, :])
Q[state, action] = (
1-learning_rate
) * Q[state, action] + learning_rate * Qtarget
episode_reward += reward
state = state2
if done:
rewards.append(episode_reward)
break
. . .
Our logic for training the agent is now complete. All that’s left is to add
reporting mechanisms.
Even though Python does not enforce strict type checking, add types to
your function declarations for cleanliness. At the top of the file, before the
first line reading import gym, import the List type:
/AtariBot/bot_3_q_table.py
. . .
from typing import List
import gym
. . .
Right after learning_rate = 0.9, outside of the main function,
declare the interval and format for reports:
/AtariBot/bot_3_q_table.py
. . .
learning_rate = 0.9
report_interval = 500
report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average:
%.2f ' \
'(Episode %d)'
def main():
. . .
Before the main function, add a new function that will populate this
report string, using the list of all rewards:
/AtariBot/bot_3_q_table.py
. . .
report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average:
%.2f ' \
'(Episode %d)'
def print_report(rewards: List, episode: int):
"""Print rewards report for current episode
- Average for last 100 episodes
- Best 100-episode average across all time
- Average for all episodes across time
"""
print(report % (
np.mean(rewards[-100:]),
max([np.mean(rewards[i:i+100]) for i in range(len(rewards) - 100)]),
np.mean(rewards),
episode))
def main():
. . .
Change the game to FrozenLake instead of SpaceInvaders:
/AtariBot/bot_3_q_table.py
. . .
def main():
env = gym.make('FrozenLake-v0') # create the game
. . .
After rewards.append(...), print the average reward over the last
100 episodes and print the average reward across all episodes:
/AtariBot/bot_3_q_table.py
. . .
if done:
rewards.append(episode_reward)
if episode % report_interval == 0:
print_report(rewards, episode)
. . .
At the end of the main() function, report both averages once more.
Do this by replacing the line that reads print('Average reward:
%.2f' % (sum(rewards) / len(rewards))) with the following
highlighted line:
/AtariBot/bot_3_q_table.py
. . .
def main():
...
break
print_report(rewards, -1)
. . .
Finally, you have completed your Q-learning agent. Check that your
script aligns with the following:
/AtariBot/bot_3_q_table.py
"""
Bot 3 -- Build simple q-learning agent for FrozenLake
"""
from typing import List
import gym
import numpy as np
import random
random.seed(0)
# make results reproducible
np.random.seed(0)
# make results reproducible
num_episodes = 4000
discount_factor = 0.8
learning_rate = 0.9
report_interval = 500
report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average:
%.2f ' \
'(Episode %d)'
def print_report(rewards: List, episode: int):
"""Print rewards report for current episode
- Average for last 100 episodes
- Best 100-episode average across all time
- Average for all episodes across time
"""
print(report % (
np.mean(rewards[-100:]),
max([np.mean(rewards[i:i+100]) for i in range(len(rewards) -
100)]),
np.mean(rewards),
episode))
def main():
env = gym.make('FrozenLake-v0')
# create the game
env.seed(0)
# make results reproducible
rewards = []
Q = np.zeros((env.observation_space.n, env.action_space.n))
for episode in range(1, num_episodes + 1):
state = env.reset()
episode_reward = 0
while True:
noise = np.random.random((1, env.action_space.n)) /
(episode**2.)
action = np.argmax(Q[state, :] + noise)
state2, reward, done, _ = env.step(action)
Qtarget = reward + discount_factor * np.max(Q[state2, :])
Q[state, action] = (
1-learning_rate
) * Q[state, action] + learning_rate * Qtarget
episode_reward += reward
state = state2
if done:
rewards.append(episode_reward)
if episode % report_interval == 0:
print_report(rewards, episode)
break
print_report(rewards, -1)
if __name__ == '__main__':
main()
Save the file, exit your editor, and run the script:
python bot_3_q_table.py
Your output will match the following:
Output
100-ep Average: 0.11 . Best 100-ep Average: 0.12 . Average: 0.03
(Episode 500)
100-ep Average: 0.25 . Best 100-ep Average: 0.24 . Average: 0.09
(Episode 1000)
100-ep Average: 0.39 . Best 100-ep Average: 0.48 . Average: 0.19
(Episode 1500)
100-ep Average: 0.43 . Best 100-ep Average: 0.55 . Average: 0.25
(Episode 2000)
100-ep Average: 0.44 . Best 100-ep Average: 0.55 . Average: 0.29
(Episode 2500)
100-ep Average: 0.64 . Best 100-ep Average: 0.68 . Average: 0.32
(Episode 3000)
100-ep Average: 0.63 . Best 100-ep Average: 0.71 . Average: 0.36
(Episode 3500)
100-ep Average: 0.56 . Best 100-ep Average: 0.78 . Average: 0.40
(Episode 4000)
100-ep Average: 0.56 . Best 100-ep Average: 0.78 . Average: 0.40
(Episode -1)
You now have your first non-trivial bot for games, but let’s put this
average reward of 0.78 into perspective. According to the Gym
FrozenLake page, “solving” the game means attaining a 100-episode
average of 0.78. Informally, “solving” means “plays the game very
well”. While not in record time, the Q-table agent is able to solve
FrozenLake in 4000 episodes.
However, the game may be more complex. Here, you used a table to
store all of the 144 possible states, but consider tic tac toe in which there
are 19,683 possible states. Likewise, consider Space Invaders where there
are too many possible states to count. A Q-table is not sustainable as
games grow increasingly complex. For this reason, you need some way to
approximate the Q-table. As you continue experimenting in the next step,
you will design a function that can accept states and actions as inputs
and output a Q-value.

最后修改: 2025年06月25日星期三 12:27