Python机器学习项目: 步骤 5 — 为《冰冻湖》构建最小二乘智能体

最小二乘法 (least squares method)，也称为线性回归 (linear regression)，是一种回归分析方法，广泛应用于数学和数据科学领域。在机器学习中，它常用于寻找两个参数或数据集的最佳线性模型。

在步骤 4 中，你构建了一个神经网络来计算 Q 值。在本步骤中，你将使用岭回归 (ridge regression)（最小二乘法的一种变体）来计算这个 Q 值向量，而不是神经网络。我们希望通过像最小二乘法这样简单的模型，解决游戏将需要更少的训练回合。

首先，复制步骤 3 中的脚本：

cp bot_3_q_table.py bot_5_ls.py

打开新文件：

nano bot_5_ls.py

再次更新文件顶部的注释，描述此脚本将做什么：

/AtariBot/bot_4_q_network.py

"""
Bot 5 -- Build least squares q-learning agent for FrozenLake
"""
. . .

在文件顶部附近的导入块之前，添加两个额外的导入用于类型检查：

/AtariBot/bot_5_ls.py

. . .
from typing import Tuple # 新增
from typing import Callable # 新增
from typing import List
import gym
. . .

在你的超参数列表中，添加另一个超参数 w_lr，以控制第二个 Q 函数的学习率。此外，将回合数更新为 5000，折扣因子更新为 0.85。通过将 num_episodes 和 discount_factor 超参数都更改为更大的值，智能体将能够表现出更强的性能：

/AtariBot/bot_5_ls.py

. . .
num_episodes = 5000 # 已更新
discount_factor = 0.85 # 已更新
learning_rate = 0.9
w_lr = 0.5 # 新增
report_interval = 500
. . .

在你的 print_report 函数之前，添加以下高阶函数。它返回一个 lambda 表达式（一个匿名函数），抽象化了模型：

/AtariBot/bot_5_ls.py

. . .
report_interval = 500
report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average: ' \
         '%.2f (Episode %d)'

def makeQ(model: np.array) -> Callable[[np.array], np.array]: # 新增
    """Returns a Q-function, which takes state -> distribution over
    actions"""
    return lambda X: X.dot(model)

def print_report(rewards: List, episode: int):
. . .

在 makeQ 之后，添加另一个函数 initialize，它使用正态分布的值初始化模型：

/AtariBot/bot_5_ls.py

. . .
def makeQ(model: np.array) -> Callable[[np.array], np.array]:
    """Returns a Q-function, which takes state -> distribution over
    actions"""
    return lambda X: X.dot(model)

def initialize(shape: Tuple): # 新增
    """Initialize model"""
    W = np.random.normal(0.0, 0.1, shape)
    Q = makeQ(W)
    return W, Q

def print_report(rewards: List, episode: int):
. . .

在 initialize 块之后，添加一个 train 方法，它计算岭回归的闭式解，然后用新模型加权旧模型。它返回模型和抽象的 Q 函数：

/AtariBot/bot_5_ls.py

. . .
def initialize(shape: Tuple):
    ...
    return W, Q

def train(X: np.array, y: np.array, W: np.array) -> Tuple[np.array,
Callable]: # 新增
    """Train the model, using solution to ridge regression"""
    I = np.eye(X.shape[1])
    newW = np.linalg.inv(X.T.dot(X) + 10e-4 * I).dot(X.T.dot(y))
    W = w_lr * newW + (1 - w_lr) * W
    Q = makeQ(W)
    return W, Q

def print_report(rewards: List, episode: int):
. . .

在 train 之后，添加最后一个函数 one_hot，用于对你的状态和动作执行独热编码：

/AtariBot/bot_5_ls.py

. . .
def train(X: np.array, y: np.array, W: np.array) -> Tuple[np.array,
Callable]:
    ...
    return W, Q

def one_hot(i: int, n: int) -> np.array: # 新增
    """Implements one-hot encoding by selecting the ith standard basis
    vector"""
    return np.identity(n)[i]

def print_report(rewards: List, episode: int):
. . .

接着，你需要修改训练逻辑。在你之前编写的脚本中，Q 表每次迭代都会更新。然而，这个脚本会收集每个时间步的样本和标签，并每 10 步训练一个新模型。此外，它将使用最小二乘模型来预测 Q 值，而不是保存 Q 表或神经网络。

转到 main 函数，并将 Q 表的定义 (Q = np.zeros(...)) 替换为以下内容：

/AtariBot/bot_5_ls.py

. . .
def main():
    ...
    rewards = []
    n_obs, n_actions = env.observation_space.n, env.action_space.n
    W, Q = initialize((n_obs, n_actions)) # 替换原有 Q 表定义
    states, labels = [], [] # 新增
    for episode in range(1, num_episodes + 1):
. . .

向下滚动到 for 循环之前。紧接着，添加以下行，如果存储的信息过多，则重置 states 和 labels 列表：

/AtariBot/bot_5_ls.py

. . .
def main():
    ...
    for episode in range(1, num_episodes + 1):
        if len(states) >= 10000: # 新增
            states, labels = [], [] # 新增
. . .

修改紧随其后定义 state = env.reset() 的行，使其变为以下内容。这将立即对状态进行独热编码，因为它的所有用法都将需要一个独热向量：

/AtariBot/bot_5_ls.py

. . .
for episode in range(1, num_episodes + 1):
    if len(states) >= 10000:
        states, labels = [], []
    state = one_hot(env.reset(), n_obs) # 更改
. . .

在你 while 主游戏循环的第一行之前，修改状态列表：

/AtariBot/bot_5_ls.py

. . .
for episode in range(1, num_episodes + 1):
    ...
    episode_reward = 0
    while True:
        states.append(state) # 新增
        noise = np.random.random((1, env.action_space.n)) / (episode**2.) # 修改
        action = np.argmax(Q(state) + noise) # 修改
        state2, reward, done, _ = env.step(action)
. . .

更新 action 的计算，降低噪声的概率，并修改 Q 函数的评估：

/AtariBot/bot_5_ls.py

. . .
while True:
    states.append(state)
    noise = np.random.random((1, n_actions)) / episode # 更改，并修复变量名 n_actions
    action = np.argmax(Q(state) + noise) # 更改，调用 Q 函数
    state2, reward, done, _ = env.step(action)
. . .

添加 state2 的独热版本，并修改 Qtarget 定义中的 Q 函数调用，如下所示：

/AtariBot/bot_5_ls.py

. . .
while True:
    ...
    state2, reward, done, _ = env.step(action)
    state2 = one_hot(state2, n_obs) # 新增
    Qtarget = reward + discount_factor * np.max(Q(state2)) # 更改，调用 Q 函数
. . .

删除更新 Q[state,action] = ... 的行，并将其替换为以下行。此代码获取当前模型的输出，并仅更新此输出中与当前所采取动作对应的值。因此，其他动作的 Q 值不会产生损失：

/AtariBot/bot_5_ls.py

. . .
state2 = one_hot(state2, n_obs)
Qtarget = reward + discount_factor * np.max(Q(state2))
label = Q(state) # 新增
label[action] = (1 - learning_rate) * label[action] + learning_rate * Qtarget # 新增
labels.append(label) # 新增
episode_reward += reward
. . .

在 state = state2 之后，添加模型的周期性更新。这将每 10 个时间步训练你的模型：

/AtariBot/bot_5_ls.py

. . .
state = state2
if len(states) % 10 == 0: # 新增
    W, Q = train(np.array(states), np.array(labels), W) # 新增
if done:
. . .

确保你的代码与以下内容匹配：

/AtariBot_5_ls.py

"""
Bot 5 -- Build least squares q-learning agent for FrozenLake
"""
from typing import Tuple
from typing import Callable
from typing import List
import gym
import numpy as np
import random

random.seed(0)  # make results reproducible
np.random.seed(0)  # make results reproducible

num_episodes = 5000 # 已更新
discount_factor = 0.85 # 已更新
learning_rate = 0.9
w_lr = 0.5 # 新增
report_interval = 500
report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average: ' \
         '%.2f (Episode %d)'

def makeQ(model: np.array) -> Callable[[np.array], np.array]: # 新增
    """Returns a Q-function, which takes state -> distribution over
    actions"""
    return lambda X: X.dot(model)

def initialize(shape: Tuple): # 新增
    """Initialize model"""
    W = np.random.normal(0.0, 0.1, shape)
    Q = makeQ(W)
    return W, Q

def train(X: np.array, y: np.array, W: np.array) -> Tuple[np.array,
Callable]: # 新增
    """Train the model, using solution to ridge regression"""
    I = np.eye(X.shape[1])
    newW = np.linalg.inv(X.T.dot(X) + 10e-4 * I).dot(X.T.dot(y))
    W = w_lr * newW + (1 - w_lr) * W
    Q = makeQ(W)
    return W, Q

def one_hot(i: int, n: int) -> np.array: # 新增
    """Implements one-hot encoding by selecting the ith standard basis
    vector"""
    return np.identity(n)[i]

def print_report(rewards: List, episode: int):
    """Print rewards report for current episode
    - Average for last 100 episodes
    - Best 100-episode average across all time
    - Average for all episodes across time
    """
    print(report % (
        np.mean(rewards[-100:]),
        max([np.mean(rewards[i:i+100]) for i in range(len(rewards) - 100)]),
        np.mean(rewards),
        episode))

def main():
    env = gym.make('FrozenLake-v0')  # create the game
    env.seed(0)  # make results reproducible
    rewards = []
    n_obs, n_actions = env.observation_space.n, env.action_space.n
    W, Q = initialize((n_obs, n_actions)) # 更改
    states, labels = [], [] # 新增

    for episode in range(1, num_episodes + 1):
        if len(states) >= 10000: # 新增
            states, labels = [], [] # 新增
        state = one_hot(env.reset(), n_obs) # 更改
        episode_reward = 0
        while True:
            states.append(state) # 新增
            noise = np.random.random((1, n_actions)) / episode # 更改，修复变量名
            action = np.argmax(Q(state) + noise) # 更改，调用 Q 函数

            state2, reward, done, _ = env.step(action)
            state2 = one_hot(state2, n_obs) # 新增
            Qtarget = reward + discount_factor * np.max(Q(state2)) # 更改，调用 Q 函数

            label = Q(state) # 新增
            label[action] = (1 - learning_rate) * label[action] + \
                            learning_rate * Qtarget # 新增
            labels.append(label) # 新增

            episode_reward += reward
            state = state2

            if len(states) % 10 == 0: # 新增
                W, Q = train(np.array(states), np.array(labels), W) # 新增

            if done:
                rewards.append(episode_reward)
                if episode % report_interval == 0:
                    print_report(rewards, episode)
                break
    print_report(rewards, -1)

if __name__ == '__main__':
    main()

然后，保存文件，退出编辑器，并运行脚本：

python bot_5_ls.py

这将输出以下内容：

Output
100-ep Average: 0.17 . Best 100-ep Average: 0.17 . Average: 0.09 (Episode 500)
100-ep Average: 0.11 . Best 100-ep Average: 0.24 . Average: 0.10 (Episode 1000)
100-ep Average: 0.08 . Best 100-ep Average: 0.24 . Average: 0.10 (Episode 1500)
100-ep Average: 0.24 . Best 100-ep Average: 0.25 . Average: 0.11 (Episode 2000)
100-ep Average: 0.32 . Best 100-ep Average: 0.31 . Average: 0.14 (Episode 2500)
100-ep Average: 0.35 . Best 100-ep Average: 0.38 . Average: 0.16 (Episode 3000)
100-ep Average: 0.59 . Best 100-ep Average: 0.62 . Average: 0.22 (Episode 3500)
100-ep Average: 0.66 . Best 100-ep Average: 0.66 . Average: 0.26 (Episode 4000)
100-ep Average: 0.60 . Best 100-ep Average: 0.72 . Average: 0.30 (Episode 4500)
100-ep Average: 0.75 . Best 100-ep Average: 0.82 . Average: 0.34 (Episode 5000)
100-ep Average: 0.75 . Best 100-ep Average: 0.82 . Average: 0.34 (Episode -1)

回顾一下，根据 Gym 《冰冻湖》页面，“解决”游戏意味着达到 0.78 的 100 回合平均值。在这里，智能体达到了 0.82 的平均值，这意味着它能够在 5000 个回合内解决游戏。尽管这并没有在更少的回合内解决游戏，但这种基本的最小二乘法仍然能够在大致相同的训练回合数下解决一个简单的游戏。尽管你的神经网络可能会变得更复杂，但你已经证明，对于《冰冻湖》来说，简单的模型就足够了。

至此，你已经探索了三种 Q-学习智能体：一种使用 Q-表，另一种使用神经网络，第三种使用最小二乘法。接下来，你将为更复杂的游戏**《太空入侵者》构建一个深度强化学习智能体**。

Step 5 — Building a Least Squares Agent for Frozen Lake

The least squares method, also known as linear regression, is a means of

regression analysis used widely in the fields of mathematics and data

science. In machine learning, it’s often used to find the optimal linear

model of two parameters or datasets.

In Step 4, you built a neural network to compute Q-values. Instead of a

neural network, in this step you will use ridge regression, a variant of

least squares, to compute this vector of Q-values. The hope is that with a

model as uncomplicated as least squares, solving the game will require

fewer training episodes.

Start by duplicating the script from Step 3:

cp bot_3_q_table.py bot_5_ls.py

Open the new file:

nano bot_5_ls.py

Again, update the comment at the top of the file describing what this

script will do:

/AtariBot/bot_4_q_network.py

"""

Bot 5 -- Build least squares q-learning agent for FrozenLake

"""

. . .

Before the block of imports near the top of your file, add two more

imports for type checking:

/AtariBot/bot_5_ls.py

. . .

from typing import Tuple

from typing import Callable

from typing import List

import gym

. . .

In your list of hyperparameters, add another hyperparameter, w_lr, to

control the second Q-function’s learning rate. Additionally, update the

number of episodes to 5000 and the discount factor to 0.85. By changing

both the num_episodes and discount_factor hyperparameters to

larger values, the agent will be able to issue a stronger performance:

/AtariBot/bot_5_ls.py

. . .

num_episodes = 5000

discount_factor = 0.85

learning_rate = 0.9

w_lr = 0.5

report_interval = 500

. . .

Before your print_report function, add the following higher-order

function. It returns a lambda — an anonymous function — that abstracts

away the model:

/AtariBot/bot_5_ls.py

. . .

report_interval = 500

report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average:

%.2f ' \

'(Episode %d)'

def makeQ(model: np.array) -> Callable[[np.array], np.array]:

"""Returns a Q-function, which takes state -> distribution over

actions"""

return lambda X: X.dot(model)

def print_report(rewards: List, episode: int):

. . .

After makeQ, add another function, initialize, which initializes the

model using normally-distributed values:

/AtariBot/bot_5_ls.py

. . .

def makeQ(model: np.array) -> Callable[[np.array], np.array]:

"""Returns a Q-function, which takes state -> distribution over

actions"""

return lambda X: X.dot(model)

def initialize(shape: Tuple):

"""Initialize model"""

W = np.random.normal(0.0, 0.1, shape)

Q = makeQ(W)

return W, Q

def print_report(rewards: List, episode: int):

. . .

After the initialize block, add a train method that computes the

ridge regression closed-form solution, then weights the old model with

the new one. It returns both the model and the abstracted Q-function:

/AtariBot/bot_5_ls.py

. . .

def initialize(shape: Tuple):

...

return W, Q

def train(X: np.array, y: np.array, W: np.array) -> Tuple[np.array,

Callable]:

"""Train the model, using solution to ridge regression"""

I = np.eye(X.shape[1])

newW = np.linalg.inv(X.T.dot(X) + 10e-4 _ I).dot(X.T.dot

)

W = w_lr _ newW + (1 - w_lr) \* W

Q = makeQ(W)

return W, Q

def print_report(rewards: List, episode: int):

. . .

After train, add one last function, one_hot, to perform one-hot

encoding for your states and actions:

/AtariBot/bot_5_ls.py

. . .

def train(X: np.array, y: np.array, W: np.array) -> Tuple[np.array,

Callable]:

...

return W, Q

def one_hot(i: int, n: int) -> np.array:

"""Implements one-hot encoding by selecting the ith standard basis

vector"""

return np.identity

[i]

def print_report(rewards: List, episode: int):

. . .

Following this, you will need to modify the training logic. In the

previous script you wrote, the Q-table was updated every iteration. This

script, however, will collect samples and labels every time step and train

a new model every 10 steps. Additionally, instead of holding a Q-table or

a neural network, it will use a least squares model to predict Q-values.

Go to the main function and replace the definition of the Q-table (Q =

np.zeros(...)) with the following:

/AtariBot/bot_5_ls.py

. . .

def main():

...

rewards = []

n_obs, n_actions = env.observation_space.n, env.action_space.n

W, Q = initialize((n_obs, n_actions))

states, labels = [], []

for episode in range(1, num_episodes + 1):

. . .

Scroll down before the for loop. Directly below this, add the following

lines which reset the states and labels lists if there is too much

information stored:

/AtariBot/bot_5_ls.py

. . .

def main():

...

for episode in range(1, num_episodes + 1):

if len(states) >= 10000:

states, labels = [], []

. . .

Modify the line directly after this one, which defines state =

env.reset(), so that it becomes the following. This will one-hot

encode the state immediately, as all of its usages will require a one-hot

vector:

/AtariBot/bot_5_ls.py

. . .

for episode in range(1, num_episodes + 1):

if len(states) >= 10000:

states, labels = [], []

state = one_hot(env.reset(), n_obs)

. . .

Before the first line in your while main game loop, amend the list of

states:

/AtariBot/bot_5_ls.py

. . .

for episode in range(1, num_episodes + 1):

...

episode_reward = 0

while True:

states.append(state)

noise = np.random.random((1, env.action_space.n)) / (episode\*\*2.)

. . .

Update the computation for action, decrease the probability of noise,

and modify the Q-function evaluation:

/AtariBot/bot_5_ls.py

. . .

while True:

states.append(state)

noise = np.random.random((1, n*actions)) / episode

action = np.argmax(Q(state) + noise)

state2, reward, done, * = env.step(action)

. . .

Add a one-hot version of state2 and amend the Q-function call in

your definition for Qtarget as follows:

/AtariBot/bot_5_ls.py

. . .

while True:

...

state2, reward, done, \_ = env.step(action)

state2 = one_hot(state2, n_obs)

Qtarget = reward + discount_factor * np.max(Q(state2))

. . .

Delete the line that updates Q[state,action] = ... and replace it

with the following lines. This code takes the output of the current model

and updates only the value in this output that corresponds to the current

action taken. As a result, Q-values for the other actions don’t incur loss:

/AtariBot/bot_5_ls.py

. . .

state2 = one_hot(state2, n_obs)

Qtarget = reward + discount_factor _ np.max(Q(state2))

label = Q(state)

label[action] = (1 - learning_rate) _ label[action] + learning_rate \*

Qtarget

labels.append(label)

episode_reward += reward

. . .

Right after state = state2, add a periodic update to the model.

This trains your model every 10 time steps:

/AtariBot/bot_5_ls.py

. . .

state = state2

if len(states) % 10 == 0:

W, Q = train(np.array(states), np.array(labels), W)

if done:

. . .

Ensure that your code matches the following:

/AtariBot_5_ls.py

"""

Bot 5 -- Build least squares q-learning agent for FrozenLake

"""

from typing import Tuple

from typing import Callable

from typing import List

import gym

import numpy as np

import random

random.seed(0)

# make results reproducible

np.random.seed(0)

# make results reproducible

num_episodes = 5000

discount_factor = 0.85

learning_rate = 0.9

w_lr = 0.5

report_interval = 500

report = '100-ep Average: %.2f . Best 100-ep Average: %.2f . Average:

%.2f ' \

'(Episode %d)'

def makeQ(model: np.array) -> Callable[[np.array], np.array]:

"""Returns a Q-function, which takes state -> distribution over

actions"""

return lambda X: X.dot(model)

def initialize(shape: Tuple):

"""Initialize model"""

W = np.random.normal(0.0, 0.1, shape)

Q = makeQ(W)

return W, Q

def train(X: np.array, y: np.array, W: np.array) -> Tuple[np.array,

Callable]:

"""Train the model, using solution to ridge regression"""

I = np.eye(X.shape[1])

newW = np.linalg.inv(X.T.dot(X) + 10e-4 * I).dot(X.T.dot

)

W = w_lr * newW + (1 - w_lr) * W

Q = makeQ(W)

return W, Q

def one_hot(i: int, n: int) -> np.array:

"""Implements one-hot encoding by selecting the ith standard basis

vector"""

return np.identity

[i]

def print_report(rewards: List, episode: int):

"""Print rewards report for current episode

- Average for last 100 episodes

- Best 100-episode average across all time

- Average for all episodes across time

"""

print(report % (

np.mean(rewards[-100:]),

max([np.mean(rewards[i:i+100]) for i in range(len(rewards) -

100)]),

np.mean(rewards),

episode))

def main():

env = gym.make('FrozenLake-v0')

# create the game

env.seed(0)

# make results reproducible

rewards = []

n_obs, n_actions = env.observation_space.n, env.action_space.n

W, Q = initialize((n_obs, n_actions))

states, labels = [], []

for episode in range(1, num_episodes + 1):

if len(states) >= 10000:

states, labels = [], []

state = one_hot(env.reset(), n_obs)

episode_reward = 0

while True:

states.append(state)

noise = np.random.random((1, n_actions)) / episode

action = np.argmax(Q(state) + noise)

state2, reward, done, _ = env.step(action)

state2 = one_hot(state2, n_obs)

Qtarget = reward + discount_factor * np.max(Q(state2))

label = Q(state)

label[action] = (1 - learning_rate) * label[action] + \

learning_rate * Qtarget

labels.append(label)

episode_reward += reward

state = state2

if len(states) % 10 == 0:

W, Q = train(np.array(states), np.array(labels), W)

if done:

rewards.append(episode_reward)

if episode % report_interval == 0:

print_report(rewards, episode)

break

print_report(rewards, -1)

if __name__ == '__main__':

main()

Then, save the file, exit the editor, and run the script:

python bot_5_ls.py

This will output the following:

Output

100-ep Average: 0.17 . Best 100-ep Average: 0.17 . Average: 0.09

(Episode 500)

100-ep Average: 0.11 . Best 100-ep Average: 0.24 . Average: 0.10

(Episode 1000)

100-ep Average: 0.08 . Best 100-ep Average: 0.24 . Average: 0.10

(Episode 1500)

100-ep Average: 0.24 . Best 100-ep Average: 0.25 . Average: 0.11

(Episode 2000)

100-ep Average: 0.32 . Best 100-ep Average: 0.31 . Average: 0.14

(Episode 2500)

100-ep Average: 0.35 . Best 100-ep Average: 0.38 . Average: 0.16

(Episode 3000)

100-ep Average: 0.59 . Best 100-ep Average: 0.62 . Average: 0.22

(Episode 3500)

100-ep Average: 0.66 . Best 100-ep Average: 0.66 . Average: 0.26

(Episode 4000)

100-ep Average: 0.60 . Best 100-ep Average: 0.72 . Average: 0.30

(Episode 4500)

100-ep Average: 0.75 . Best 100-ep Average: 0.82 . Average: 0.34

(Episode 5000)

100-ep Average: 0.75 . Best 100-ep Average: 0.82 . Average: 0.34

(Episode -1)

Recall that, according to the Gym FrozenLake page, “solving” the

game means attaining a 100-episode average of 0.78. Here the agent

acheives an average of 0.82, meaning it was able to solve the game in

5000 episodes. Although this does not solve the game in fewer episodes,

this basic least squares method is still able to solve a simple game with

roughly the same number of training episodes. Although your neural

networks may grow in complexity, you’ve shown that simple models are

sufficient for FrozenLake.

With that, you have explored three Q-learning agents: one using a Q-

table, another using a neural network, and a third using least squares.

Next, you will build a deep reinforcement learning agent for a more

complex game: Space Invaders.

最后修改: 2025年06月25日星期三 12:38