Python 机器学习
通过集成提高性能 (Performance Improvement with Ensembles)
集成可以通过结合多个模型来提升机器学习结果。基本上,集成模型由几个单独训练的监督学习模型组成,它们的結果以各种方式合并,以实现比单个模型更好的预测性能。集成方法可以分为以下两组:
- 序列集成方法 (Sequential ensemble methods):顾名思义,在这类集成方法中,基础学习器是按序列生成的。这类方法的动机是利用基础学习器之间的依赖性。
- 并行集成方法 (Parallel ensemble methods):顾名思义,在这类集成方法中,基础学习器是并行生成的。这类方法的动机是利用基础学习器之间的独立性。
集成学习方法 (Ensemble Learning Methods)
以下是最流行的集成学习方法,即组合不同模型预测的方法:
Bagging (装袋法)
“Bagging”一词也称为“bootstrap aggregation”(自助聚合)。在 Bagging 方法中,集成模型通过组合在随机生成的训练样本上训练的单个模型的预测来提高预测准确性并降低模型方差。集成模型的最终预测将通过计算单个估计器所有预测的平均值来给出。Bagging 方法的最佳示例之一是随机森林。
Boosting (提升法)
在 Boosting 方法中,构建集成模型的主要原则是通过顺序训练每个基础模型估计器来增量地构建它。顾名思义,它基本上结合了几个弱基础学习器,这些学习器在训练数据的多次迭代中顺序训练,以构建强大的集成。在训练弱基础学习器期间,那些早期被错误分类的学习器会被赋予更高的权重。Boosting 方法的示例是 AdaBoost。
Voting (投票法)
在这种集成学习模型中,构建了多种不同类型的模型,并使用一些简单的统计数据(如计算平均值或中位数等)来组合预测。这个预测将作为训练的额外输入,以进行最终预测。
Bagging 集成算法 (Bagging Ensemble Algorithms)
以下是三种 Bagging 集成算法:
装袋决策树 (Bagged Decision Tree)
我们知道 Bagging 集成方法与具有高方差的算法配合良好,在这方面,决策树算法是最佳选择。在以下 Python 示例中,我们将通过在 Pima Indians 糖尿病数据集上使用 sklearn
的 BaggingClassifier
函数和 DecisionTreeClassifier
(一种分类和回归树算法)来构建装袋决策树集成模型。
首先,导入所需的包,如下所示:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
现在,我们需要像前面的示例一样加载 Pima 糖尿病数据集:
# 假设您已将 'pima-indians-diabetes.csv' 文件保存在本地路径
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,给出 10 折交叉验证的输入,如下所示:
seed = 7
kfold = KFold(n_splits=10, random_state=seed, shuffle=True) # 添加 shuffle=True 以确保可重复性
cart = DecisionTreeClassifier()
我们需要提供要构建的树的数量。这里我们构建 150 棵树:
num_trees = 150
接下来,借助以下脚本构建模型:
# BaggingClassifier 的 base_estimator 参数在 sklearn 0.24 后已弃用,
# 应使用 estimator 参数。为了兼容,这里仍使用 base_estimator。
# 如果使用较新版本,请考虑改为 estimator。
model = BaggingClassifier(estimator=cart, n_estimators=num_trees, # 将 base_estimator 改为 estimator
random_state=seed)
计算并打印结果,如下所示:
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
输出:
0.7733766233766234
上述输出显示我们的装袋决策树分类器模型的准确率约为 77%。
随机森林 (Random Forest)
它是装袋决策树的扩展。对于单个分类器,训练数据集的样本是带替换抽样的,但树的构建方式可以减少它们之间的相关性。此外,在构建每棵树时,会考虑特征的随机子集来选择每个分裂点,而不是贪婪地选择最佳分裂点。
在以下 Python 示例中,我们将通过在 Pima Indians 糖尿病数据集上使用 sklearn
的 RandomForestClassifier
类来构建装袋随机森林集成模型。
首先,导入所需的包,如下所示:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
现在,我们需要像前面的示例一样加载 Pima 糖尿病数据集:
# 假设您已将 'pima-indians-diabetes.csv' 文件保存在本地路径
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,给出 10 折交叉验证的输入,如下所示:
seed = 7
kfold = KFold(n_splits=10, random_state=seed, shuffle=True) # 添加 shuffle=True
我们需要提供要构建的树的数量。这里我们构建 150 棵树,分裂点从 5 个特征中选择:
num_trees = 150
max_features = 5
接下来,借助以下脚本构建模型:
model = RandomForestClassifier(n_estimators=num_trees,
max_features=max_features, random_state=seed) # 添加 random_state
计算并打印结果,如下所示:
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
输出:
0.7629357484620642
上述输出显示我们的装袋随机森林分类器模型的准确率约为 76%。
极限树 (Extra Trees)
它是装袋决策树集成方法的另一个扩展。在这种方法中,随机树是从训练数据集的样本中构建的。
在以下 Python 示例中,我们将通过在 Pima Indians 糖尿病数据集上使用 sklearn
的 ExtraTreesClassifier
类来构建极限树集成模型。
首先,导入所需的包,如下所示:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier
现在,我们需要像前面的示例一样加载 Pima 糖尿病数据集:
# 假设您已将 'pima-indians-diabetes.csv' 文件保存在本地路径
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,给出 10 折交叉验证的输入,如下所示:
seed = 7
kfold = KFold(n_splits=10, random_state=seed, shuffle=True) # 添加 shuffle=True
我们需要提供要构建的树的数量。这里我们构建 150 棵树,分裂点从 5 个特征中选择:
num_trees = 150
max_features = 5
接下来,借助以下脚本构建模型:
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features, random_state=seed) # 添加 random_state
计算并打印结果,如下所示:
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
输出:
0.7551435406698566
上述输出显示我们的装袋极限树分类器模型的准确率约为 75.5%。
Boosting 集成算法 (Boosting Ensemble Algorithms)
以下是两种最常见的 Boosting 集成算法:
AdaBoost (适应性增强)
它是最成功的 Boosting 集成算法之一。该算法的关键在于它们赋予数据集中实例权重的方式。因此,在构建后续模型时,算法需要较少地关注这些实例。
在以下 Python 示例中,我们将通过在 Pima Indians 糖尿病数据集上使用 sklearn
的 AdaBoostClassifier
类来构建用于分类的 Ada Boost 集成模型。
首先,导入所需的包,如下所示:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
现在,我们需要像前面的示例一样加载 Pima 糖尿病数据集:
# 假设您已将 'pima-indians-diabetes.csv' 文件保存在本地路径
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,给出 10 折交叉验证的输入,如下所示:
seed = 5
kfold = KFold(n_splits=10, random_state=seed, shuffle=True) # 添加 shuffle=True
我们需要提供要构建的树的数量。这里我们构建 50 棵树:
num_trees = 50
接下来,借助以下脚本构建模型:
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
计算并打印结果,如下所示:
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
输出:
0.7539473684210527
上述输出显示我们的 AdaBoost 分类器集成模型的准确率约为 75%。
随机梯度提升 (Stochastic Gradient Boosting)
它也称为梯度提升机 (Gradient Boosting Machines)。在以下 Python 示例中,我们将通过在 Pima Indians 糖尿病数据集上使用 sklearn
的 GradientBoostingClassifier
类来构建用于分类的随机梯度提升集成模型。
首先,导入所需的包,如下所示:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
现在,我们需要像前面的示例一样加载 Pima 糖尿病数据集:
# 假设您已将 'pima-indians-diabetes.csv' 文件保存在本地路径
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,给出 10 折交叉验证的输入,如下所示:
seed = 5
kfold = KFold(n_splits=10, random_state=seed, shuffle=True) # 添加 shuffle=True
我们需要提供要构建的树的数量。这里我们构建 50 棵树:
num_trees = 50
接下来,借助以下脚本构建模型:
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
计算并打印结果,如下所示:
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
输出:
0.7746582365003418
上述输出显示我们的梯度提升分类器集成模型的准确率约为 77.5%。
投票集成算法 (Voting Ensemble Algorithms)
如前所述,投票首先从训练数据集中创建两个或多个独立模型,然后投票分类器将模型包装起来,并在需要新数据时对子模型的预测取平均值。
在以下 Python 示例中,我们将通过在 Pima Indians 糖尿病数据集上使用 sklearn
的 VotingClassifier
类来构建用于分类的投票集成模型。我们将逻辑回归、决策树分类器和 SVM 的预测组合在一起用于分类问题,如下所示:
首先,导入所需的包,如下所示:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
现在,我们需要像前面的示例一样加载 Pima 糖尿病数据集:
# 假设您已将 'pima-indians-diabetes.csv' 文件保存在本地路径
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
接下来,给出 10 折交叉验证的输入,如下所示:
kfold = KFold(n_splits=10, random_state=7, shuffle=True) # 添加 shuffle=True
接下来,我们需要创建子模型,如下所示:
estimators = []
model1 = LogisticRegression(solver='liblinear') # 添加 solver 参数
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC(probability=True) # SVC 在 VotingClassifier 中需要 probability=True
estimators.append(('svm', model3))
现在,通过组合上面创建的子模型的预测来创建投票集成模型。
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())
输出:
0.7382262474367738
上述输出显示我们的投票分类器集成模型的准确率约为 74%。
24. Machine Learning – Improving Performance
Machine Learning with of ML Python
Models
Performance Improvement with Ensembles
Ensembles can give us boost in the machine learning result by combining several models.
Basically, ensemble models consist of several individually trained supervised learning
models and their results are merged in various ways to achieve better predictive
performance compared to a single model. Ensemble methods can be divided into following
two groups:
Sequential ensemble methods
As the name implies, in these kind of ensemble methods, the base learners are generated
sequentially. The motivation of such methods is to exploit the dependency among base
learners.
Parallel ensemble methods
As the name implies, in these kind of ensemble methods, the base learners are generated
in parallel. The motivation of such methods is to exploit the independence among base
learners.
Ensemble Learning Methods
The following are the most popular ensemble learning methods i.e. the methods for
combining the predictions from different models:
Bagging
The term bagging is also known as bootstrap aggregation. In bagging methods, ensemble
model tries to improve prediction accuracy and decrease model variance by combining
predictions of individual models trained over randomly generated training samples. The
final prediction of ensemble model will be given by calculating the average of all predictions
from the individual estimators. One of the best examples of bagging methods are random
forests.
Boosting
In boosting method, the main principle of building ensemble model is to build it
incrementally by training each base model estimator sequentially. As the name suggests,
it basically combine several week base learners, trained sequentially over multiple
iterations of training data, to build powerful ensemble. During the training of week base
learners, higher weights are assigned to those learners which were misclassified earlier.
The example of boosting method is AdaBoost.
148
Machine Learning with Python
Voting
In this ensemble learning model, multiple models of different types are built and some
simple statistics, like calculating mean or median etc., are used to combine the predictions.
This prediction will serve as the additional input for training to make the final prediction.
Bagging Ensemble Algorithms
The following are three bagging ensemble algorithms:
Bagged Decision Tree:
As we know that bagging ensemble methods work well with the algorithms that have high
variance and, in this concern, the best one is decision tree algorithm. In the following
Python recipe, we are going to build bagged decision tree ensemble model by using
BaggingClassifier function of sklearn with DecisionTreeClasifier (a classification &
regression trees algorithm) on Pima Indians diabetes dataset.
First, import the required packages as follows:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
Now, we need to load the Pima diabetes dataset as we did in the previous examples:
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
Next, give the input for 10-fold cross validation as follows:
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()
We need to provide the number of trees we are going to build. Here we are building 150
trees:
num_trees = 150
149
Machine Learning with Python
Next, build the model with the help of following script:
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees,
random_state=seed)
Calculate and print the result as follows:
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Output:
0.7733766233766234
The output above shows that we got around 77% accuracy of our bagged decision tree
classifier model.
Random Forest
It is an extension of bagged decision trees. For individual classifiers, the samples of
training dataset are taken with replacement, but the trees are constructed in such a way
that reduces the correlation between them. Also, a random subset of features is considered
to choose each split point rather than greedily choosing the best split point in construction
of each tree.
In the following Python recipe, we are going to build bagged random forest ensemble
model by using RandomForestClassifier class of sklearn on Pima Indians diabetes
dataset.
First, import the required packages as follows:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
Now, we need to load the Pima diabetes dataset as did in previous examples:
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
Next, give the input for 10-fold cross validation as follows:
seed = 7
150
Machine Learning with Python
kfold = KFold(n_splits=10, random_state=seed)
We need to provide the number of trees we are going to build. Here we are building 150
trees with split points chosen from 5 features:
num_trees = 150
max_features = 5
Next, build the model with the help of following script:
model = RandomForestClassifier(n_estimators=num_trees,
max_features=max_features)
Calculate and print the result as follows:
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Output
0.7629357484620642
The output above shows that we got around 76% accuracy of our bagged random forest
classifier model.
Extra Trees
It is another extension of bagged decision tree ensemble method. In this method, the
random trees are constructed from the samples of the training dataset.
In the following Python recipe, we are going to build extra tree ensemble model by using
ExtraTreesClassifier class of sklearn on Pima Indians diabetes dataset.
First, import the required packages as follows:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier
Now, we need to load the Pima diabetes dataset as did in previous examples:
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
151
Machine Learning with Python
Y = array[:,8]
Next, give the input for 10-fold cross validation as follows:
seed = 7
kfold = KFold(n_splits=10, random_state=seed)
We need to provide the number of trees we are going to build. Here we are building 150
trees with split points chosen from 5 features:
num_trees = 150
max_features = 5
Next, build the model with the help of following script:
model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)
Calculate and print the result as follows:
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Output
0.7551435406698566
The output above shows that we got around 75.5% accuracy of our bagged extra trees
classifier model.
Boosting Ensemble Algorithms
The followings are the two most common boosting ensemble algorithms:
AdaBoost
It is one the most successful boosting ensemble algorithm. The main key of this algorithm
is in the way they give weights to the instances in dataset. Due to this the algorithm needs
to pay less attention to the instances while constructing subsequent models.
In the following Python recipe, we are going to build Ada Boost ensemble model for
classification by using AdaBoostClassifier class of sklearn on Pima Indians diabetes
dataset.
First, import the required packages as follows:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
152
Machine Learning with Python
Now, we need to load the Pima diabetes dataset as did in previous examples:
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
Next, give the input for 10-fold cross validation as follows:
seed = 5
kfold = KFold(n_splits=10, random_state=seed)
We need to provide the number of trees we are going to build. Here we are building 150
trees with split points chosen from 5 features:
num_trees = 50
Next, build the model with the help of following script:
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
Calculate and print the result as follows:
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Output
0.7539473684210527
The output above shows that we got around 75% accuracy of our AdaBoost classifier
ensemble model.
Stochastic Gradient Boosting
It is also called Gradient Boosting Machines. In the following Python recipe, we are going
to build Stochastic Gradient Boostingensemble model for classification by using
GradientBoostingClassifier class of sklearn on Pima Indians diabetes dataset.
First, import the required packages as follows:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
153
Machine Learning with Python
from sklearn.ensemble import GradientBoostingClassifier
Now, we need to load the Pima diabetes dataset as did in previous examples:
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
Next, give the input for 10-fold cross validation as follows:
seed = 5
kfold = KFold(n_splits=10, random_state=seed)
We need to provide the number of trees we are going to build. Here we are building 150
trees with split points chosen from 5 features:
num_trees = 50
Next, build the model with the help of following script:
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
Calculate and print the result as follows:
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
Output
0.7746582365003418
The output above shows that we got around 77.5% accuracy of our Gradient Boosting
classifier ensemble model.
Voting Ensemble Algorithms
As discussed, voting first creates two or more standalone models from training dataset
and then a voting classifier will wrap the model along with taking the average of the
predictions of sub-model whenever needed new data.
In the following Python recipe, we are going to build Voting ensemble model for
classification by using VotingClassifier class of sklearn on Pima Indians diabetes
dataset. We are combining the predictions of logistic regression, Decision Tree classifier
and SVM together for a classification problem as follows:
154
Machine Learning with Python
First, import the required packages as follows:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
Now, we need to load the Pima diabetes dataset as did in previous examples:
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]
Next, give the input for 10-fold cross validation as follows:
kfold = KFold(n_splits=10, random_state=7)
Next, we need to create sub-models as follows:
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))
Now, create the voting ensemble model by combining the predictions of above created sub
models.
ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())
155
Machine Learning with Python
Output
0.7382262474367738
The output above shows that we got around 74% accuracy of our voting classifier
ensemble model.