Python 机器学习
在上一章中,我们详细了解了如何为机器学习预处理和准备数据。在本章中,让我们详细了解数据特征选择及其涉及的各个方面。
数据特征选择的重要性
机器学习模型的性能与用于训练它的数据特征成正比。如果提供给机器学习模型的数据特征不相关,其性能将受到负面影响。另一方面,使用相关的数据特征可以提高机器学习模型的准确性,尤其是线性回归和逻辑回归。
现在出现了一个问题:什么是自动特征选择?它可以定义为一种过程,借助该过程,我们选择数据中与我们感兴趣的输出或预测变量最相关的特征。它也称为属性选择。
以下是在数据建模之前进行自动特征选择的一些好处:
- 在数据建模之前进行特征选择将减少过拟合。
- 在数据建模之前进行特征选择将提高机器学习模型的准确性。
- 在数据建模之前进行特征选择将减少训练时间。
特征选择技术
以下是我们可以在 Python 中用于对机器学习数据进行建模的自动特征选择技术:
单变量选择 (Univariate Selection)
这种特征选择技术在借助统计检验选择与预测变量关系最强的特征方面非常有用。我们可以借助 scikit-learn Python 库的 SelectKBest
类来实现单变量特征选择技术。
示例:
在此示例中,我们将使用 Pima Indians 糖尿病数据集,借助卡方统计检验选择 4 个具有最佳特征的属性。
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values
接下来,我们将数组分离为输入和输出部分:
X = array[:,0:8]
Y = array[:,8]
以下代码行将从数据集中选择最佳特征:
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X,Y)
我们还可以根据我们的选择总结输出数据。在这里,我们将精度设置为 2,并显示具有最佳特征的 4 个数据属性以及每个属性的最佳分数:
set_printoptions(precision=2)
print(fit.scores_)
featured_data = fit.transform(X)
print ("\nFeatured data:\n", featured_data[0:4])
输出
[ 111.52 1411.89 17.61 53.11 2175.57 127.67 5.39 181.3 ]
Featured data:
[[148. 0. 33.6 50. ]
[ 85. 0. 26.6 31. ]
[183. 0. 23.3 32. ]
[ 89. 94. 28.1 21. ]]
递归特征消除 (Recursive Feature Elimination, RFE)
顾名思义,RFE(递归特征消除)特征选择技术会递归地移除属性并使用剩余属性构建模型。我们可以借助 scikit-learn Python 库的 RFE
类来实现 RFE 特征选择技术。
示例
在此示例中,我们将使用 RFE 和逻辑回归算法从 Pima Indians 糖尿病数据集中选择具有最佳特征的 3 个最佳属性。
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values
接下来,我们将数组分离为输入和输出部分:
X = array[:,0:8]
Y = array[:,8]
以下代码行将从数据集中选择最佳特征:
model = LogisticRegression(solver='liblinear') # 添加 solver 参数以避免警告
rfe = RFE(model, n_features_to_select=3) # 更新 RFE 构造函数以适应新版本
fit = rfe.fit(X, Y)
print("Number of Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)
输出
Number of Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 3 5 6 1 1 4]
从上面的输出中,我们可以看到 RFE 选择 preg
、mass
和 pedi
作为前 3 个最佳特征。它们在输出中被标记为 1。
主成分分析 (Principal Component Analysis, PCA)
PCA,通常称为数据降维技术,是一种非常有用的特征选择技术,因为它使用线性代数将数据集转换为压缩形式。我们可以借助 scikit-learn Python 库的 PCA
类来实现 PCA 特征选择技术。我们可以在输出中选择主成分的数量。
示例:
在此示例中,我们将使用 PCA 从 Pima Indians 糖尿病数据集中选择 3 个最佳主成分。
from pandas import read_csv
from sklearn.decomposition import PCA
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values
接下来,我们将数组分离为输入和输出部分:
X = array[:,0:8]
Y = array[:,8]
以下代码行将从数据集中提取特征:
pca = PCA(n_components=3)
fit = pca.fit(X)
print("Explained Variance: %s" % fit.explained_variance_ratio_) # 修正了打印方式
print(fit.components_)
输出
Explained Variance: [0.89 0.06 0.03]
[[ -2.02e-03 9.78e-02 1.61e-02 6.08e-02 9.93e-01 1.40e-02 5.37e-04 -3.56e-03]
[ 2.26e-02 9.72e-01 1.42e-01 -5.79e-02 -9.46e-02 4.70e-02 8.17e-04 1.40e-01]
[-2.25e-02 1.43e-01 -9.22e-01 -3.07e-01 2.10e-02 -1.32e-01 -6.40e-04 -1.25e-01]]
从上面的输出中我们可以观察到,3 个主成分与原始数据几乎没有相似之处。
特征重要性 (Feature Importance)
顾名思义,特征重要性技术用于选择重要特征。它基本上使用经过训练的监督分类器来选择特征。我们可以借助 scikit-learn Python 库的 ExtraTreesClassifier
类来实现此特征选择技术。
示例
在此示例中,我们将使用 ExtraTreesClassifier 从 Pima Indians 糖尿病数据集中选择特征。
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
path = r'C:\pima-indians-diabetes.csv' # 假设路径修正为 C:\pima-indians-diabetes.csv
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values
接下来,我们将数组分离为输入和输出部分:
X = array[:,0:8]
Y = array[:,8]
以下代码行将从数据集中提取特征:
model = ExtraTreesClassifier(n_estimators=100, random_state=7) # 增加 n_estimators 和 random_state 以提高稳定性
model.fit(X, Y)
print(model.feature_importances_)
输出
[0.11070069 0.2213717 0.08824115 0.08068703 0.07281761 0.14548537
0.12654214 0.15415431]
从输出中,我们可以观察到每个属性都有一个分数。分数越高,该属性的重要性越高。
8. Machine Learning with Python – Machine
Data Learning
Feature
with Python
Selection
In the previous chapter, we have seen in detail how to preprocess and prepare data for
machine learning. In this chapter, let us understand in detail data feature selection and
various aspects involved in it.
Importance of Data Feature Selection
The performance of machine learning model is directly proportional to the data features
used to train it. The performance of ML model will be affected negatively if the data
features provided to it are irrelevant. On the other hand, use of relevant data features can
increase the accuracy of your ML model especially linear and logistic regression.
Now the question arise that what is automatic feature selection? It may be defined as the
process with the help of which we select those features in our data that are most relevant
to the output or prediction variable in which we are interested. It is also called attribute
selection.
The following are some of the benefits of automatic feature selection before modeling the
data:
Performing feature selection before data modeling will reduce the overfitting.
Performing feature selection before data modeling will increases the accuracy of ML
model.
Performing feature selection before data modeling will reduce the training time
Feature Selection Techniques
The followings are automatic feature selection techniques that we can use to model ML
data in Python:
Univariate Selection
This feature selection technique is very useful in selecting those features, with the help of
statistical testing, having strongest relationship with the prediction variables. We can
implement univariate feature selection technique with the help of SelectKBest0class of
scikit-learn Python library.
Example:
In this example, we will use Pima Indians Diabetes dataset to select 4 of the attributes
having best features with the help of chi-square statistical test.
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
51
Machine Learning with Python
from sklearn.feature_selection import chi2
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
dataframe = read_csv(path, names=names)
array = dataframe.values
Next, we will separate array into input and output components:
X = array[:,0:8]
Y = array[:,8]
The following lines of code will select the best features from dataset:
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X,Y)
We can also summarize the data for output as per our choice. Here, we are setting the
precision to 2 and showing the 4 data attributes with best features along with best score
of each attribute:
set_printoptions(precision=2)
print(fit.scores_)
featured_data = fit.transform(X)
print ("\nFeatured data:\n", featured_data[0:4])
Output
[ 111.52 1411.89
17.61
53.11 2175.57
127.67
5.39
181.3 ]
Featured data:
[[148.
0.
33.6
50. ]
[ 85.
0.
26.6
31. ]
[183.
0.
23.3
32. ]
[ 89.
94.
28.1
21. ]]
52
Machine Learning with Python
Recursive Feature Elimination
As the name suggests, RFE (Recursive feature elimination) feature selection technique
removes the attributes recursively and builds the model with remaining attributes. We can
implement RFE feature selection technique with the help of RFE class of scikit-learn
Python library.
Example
In this example, we will use RFE with logistic regression algorithm to select the best 3
attributes having the best features from Pima Indians Diabetes dataset to.
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
dataframe = read_csv(path, names=names)
array = dataframe.values
Next, we will separate the array into its input and output components:
X = array[:,0:8]
Y = array[:,8]
The following lines of code will select the best features from a dataset:
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Number of Features: %d")
print("Selected Features: %s")
print("Feature Ranking: %s")
Output
Number of Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 3 5 6 1 1 4]
We can see in above output, RFE choose preg, mass and pedi as the first 3 best features.
They are marked as 1 in the output.
53
Machine Learning with Python
Principal Component Analysis (PCA)
PCA, generally called data reduction technique, is very useful feature selection technique
as it uses linear algebra to transform the dataset into a compressed form. We can
implement PCA feature selection technique with the help of PCA class of scikit-learn
Python library. We can select number of principal components in the output.
Example:
In this example, we will use PCA to select best 3 Principal components from Pima Indians
Diabetes dataset.
from pandas import read_csv
from sklearn.decomposition import PCA
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
dataframe = read_csv(path, names=names)
array = dataframe.values
Next, we will separate array into input and output components:
X = array[:,0:8]
Y = array[:,8]
The following lines of code will extract features from dataset:
pca = PCA(n_components=3)
fit = pca.fit(X)
print("Explained Variance: %s") % fit.explained_variance_ratio_
print(fit.components_)
Output
Explained Variance: [ 0.88854663 0.06159078 0.02579012]
[[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02
9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03]
[ 2.26488861e-02 9.72210040e-01 1.41909330e-01 -5.78614699e-02
-9.46266913e-02 4.69729766e-02 8.16804621e-04 1.40168181e-01]
[ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01
2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]
54
Machine Learning with Python
We can observe from the above output that 3 Principal Components bear little resemblance
to the source data.
Feature Importance
As the name suggests, feature importance technique is used to choose the importance
features. It basically uses a trained supervised classifier to select features. We can
implement this feature selection technique with the help of ExtraTreeClassifier class of
scikit-learn Python library.
Example
In this example, we will use ExtraTreeClassifier to select features from Pima Indians
Diabetes dataset.
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
path = r'C:\Desktop\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
dataframe = read_csv(data, names=names)
array = dataframe.values
Next, we will separate array into input and output components:
X = array[:,0:8]
Y = array[:,8]
The following lines of code will extract features from dataset:
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)
Output
[ 0.11070069 0.2213717 0.08824115 0.08068703 0.07281761 0.14548537 0.12654214
0.15415431]
From the output, we can observe that there are scores for each attribute. The higher the
score, higher is the importance of that attribute.