Python 机器学习
通过算法调优提高性能 (Performance Improvement with Algorithm Tuning)
我们知道机器学习模型的参数化方式可以针对特定问题调整其行为。算法调优意味着找到这些参数的最佳组合,从而提高机器学习模型的性能。这个过程有时被称为超参数优化,算法本身的参数被称为超参数,而机器学习算法找到的系数被称为参数。
通过算法调优提高性能 (Performance Improvement with Algorithm Tuning)
在这里,我们将讨论 Python Scikit-learn 提供的一些算法参数调优方法。
网格搜索参数调优 (Grid Search Parameter Tuning)
这是一种参数调优方法。这种方法的工作关键点是它有条不紊地为网格中指定的算法参数的每种可能组合构建和评估模型。因此,我们可以说这种算法具有搜索性质。
示例 (Example)
在下面的 Python 示例中,我们将使用 sklearn
的 GridSearchCV
类在 Pima Indians 糖尿病数据集上对 Ridge 回归算法的各种 alpha 值执行网格搜索。
首先,导入所需的包,如下所示:
import numpy
from pandas import read_csv
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
现在,我们需要像前面的示例一样加载 Pima 糖尿病数据集:
# 假设您已将 'pima-indians-diabetes.csv' 文件保存在本地路径
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,评估各种 alpha 值,如下所示:
alphas = numpy.array([1,0.1,0.01,0.001,0.0001,0])
param_grid = dict(alpha=alphas)
现在,我们需要将网格搜索应用于我们的模型:
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5) # 显式设置交叉验证折数
grid.fit(X, Y)
使用以下脚本行打印结果:
print(grid.best_score_)
print(grid.best_estimator_.alpha)
输出:
0.2796175593129722
1.0
上述输出给我们提供了最佳分数和达到该分数的网格中的参数集。在这种情况下,alpha 值为 1.0。
随机搜索参数调优 (Random Search Parameter Tuning)
这是一种参数调优方法。这种方法的工作关键点是它从随机分布中采样算法参数,进行固定次数的迭代。
示例 (Example)
在下面的 Python 示例中,我们将使用 sklearn
的 RandomizedSearchCV
类在 Pima Indians 糖尿病数据集上对 Ridge 回归算法评估 0 到 1 之间不同的 alpha 值,执行随机搜索。
首先,导入所需的包,如下所示:
import numpy
from pandas import read_csv
from scipy.stats import uniform
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV
现在,我们需要像前面的示例一样加载 Pima 糖尿病数据集:
# 假设您已将 'pima-indians-diabetes.csv' 文件保存在本地路径
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,在 Ridge 回归算法上评估各种 alpha 值,如下所示:
param_grid = {'alpha': uniform()} # alpha 在 0 到 1 之间均匀分布
model = Ridge()
random_search = RandomizedSearchCV(estimator=model,
param_distributions=param_grid, n_iter=50,
random_state=7, cv=5) # 显式设置交叉验证折数
random_search.fit(X, Y)
使用以下脚本行打印结果:
print(random_search.best_score_)
print(random_search.best_estimator_.alpha)
输出:
0.27961712703051084
0.9779895119966027
上述输出给我们提供了与网格搜索相似的最佳分数。
25. Machine Learning – Improving Performance
Machine Learning with ofPython
ML Model
(Contd...)
Performance Improvement with Algorithm Tuning
As we know that ML models are parameterized in such a way that their behavior can be
adjusted for a specific problem. Algorithm tuning means finding the best combination of
these parameters so that the performance of ML model can be improved. This process
sometimes called hyperparameter optimization and the parameters of algorithm itself are
called hyperparameters and coefficients found by ML algorithm are called parameters.
Performance Improvement with Algorithm Tuning
Here, we are going to discuss about some methods for algorithm parameter tuning
provided by Python Scikit-learn.
Grid Search Parameter Tuning
It is a parameter tuning approach. The key point of working of this method is that it builds
and evaluate the model methodically for every possible combination of algorithm
parameter specified in a grid. Hence, we can say that this algorithm is having search
nature.
Example
In the following Python recipe, we are going to perform grid search by using GridSearchCV
class of sklearn for evaluating various alpha values for the Ridge Regression algorithm
on Pima Indians diabetes dataset.
First, import the required packages as follows:
import numpy
from pandas import read_csv
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
Now, we need to load the Pima diabetes dataset as did in previous examples:
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
157
Machine Learning with Python
Next, evaluate the various alpha values as follows;
alphas = numpy.array([1,0.1,0.01,0.001,0.0001,0])
param_grid = dict(alpha=alphas)
Now, we need to apply grid search on our model:
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid.fit(X, Y)
Print the result with following script line:
print(grid.best_score_)
print(grid.best_estimator_.alpha)
Output:
0.2796175593129722
1.0
The above output gives us the optimal score and the set of parameters in the grid that
achieved that score. The alpha value in this case is 1.0.
Random Search Parameter Tuning
It is a parameter tuning approach. The key point of working of this method is that it
samples the algorithm parameters from a random distribution for a fixed number of
iterations.
Example
In the following Python recipe, we are going to perform random search by using
RandomizedSearchCV class of sklearn for evaluating different alpha values between 0
and 1 for the Ridge Regression algorithm on Pima Indians diabetes dataset.
First, import the required packages as follows:
import numpy
from pandas import read_csv
from scipy.stats import uniform
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV
Now, we need to load the Pima diabetes dataset as did in previous examples:
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
158
Machine Learning with Python
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
Next, evaluate the various alpha values on Ridge regression algorithm as follows;
param_grid = {'alpha': uniform()}
model = Ridge()
random_search = RandomizedSearchCV(estimator=model,
param_distributions=param_grid, n_iter=50,
random_state=7)
random_search.fit(X, Y)
Print the result with following script line:
print(random_search.best_score_)
print(random_search.best_estimator_.alpha)
Output
0.27961712703051084
0.9779895119966027
The above output gives us the optimal score just similar to the grid search.