引言

机器学习算法完全依赖于数据,因为它是使模型训练成为可能的最关键方面。另一方面,如果我们无法在将数据提供给机器学习算法之前理解数据,那么机器将毫无用处。简单来说,我们总是需要提供正确的数据,即具有正确比例、格式并包含有意义特征的数据,以解决我们希望机器解决的问题。这使得数据准备成为机器学习过程中最重要的一步。数据准备可以定义为使数据集更适合机器学习过程的程序。

为什么进行数据预处理?

在为机器学习训练选择原始数据后,最重要的任务是数据预处理。从广义上讲,数据预处理将选定的数据转换为我们可以使用或可以提供给机器学习算法的形式。我们总是需要对数据进行预处理,使其符合机器学习算法的期望。

数据预处理技术

我们有以下数据预处理技术,可以应用于数据集以生成用于机器学习算法的数据:

缩放 (Scaling)

我们的数据集很可能包含比例不同的属性,但我们不能将此类数据直接提供给机器学习算法,因此它需要重新缩放。数据重新缩放确保属性处于相同的比例。通常,属性被重新缩放到 0 到 1 的范围内。梯度下降和 K 近邻等机器学习算法需要缩放后的数据。我们可以借助 scikit-learn Python 库的 MinMaxScaler 类来重新缩放数据。

示例

在此示例中,我们将重新缩放我们之前使用的 Pima Indians 糖尿病数据集的数据。首先,将加载 CSV 数据(如前几章中所述),然后借助 MinMaxScaler 类,它将被重新缩放到 0 到 1 的范围内。

以下脚本的前几行与我们之前加载 CSV 数据时所写的相同。

Python
from pandas import read_csv
from numpy import set_printoptions
from sklearn import preprocessing

path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

现在,我们可以使用 MinMaxScaler 类将数据重新缩放到 0 到 1 的范围内。

Python
data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
data_rescaled = data_scaler.fit_transform(array)

我们还可以根据我们的选择总结输出数据。在这里,我们将精度设置为 1,并在输出中显示前 10 行。

Python
set_printoptions(precision=1)
print ("\nScaled data:\n", data_rescaled[0:10])

输出

Scaled data:
[[0.4 0.7 0.6 0.4 0.  0.5 0.2 0.5 1. ]
 [0.1 0.4 0.5 0.3 0.  0.4 0.1 0.2 0. ]
 [0.5 0.9 0.5 0.  0.  0.3 0.3 0.2 1. ]
 [0.1 0.4 0.5 0.2 0.1 0.4 0.  0.  0. ]
 [0.  0.7 0.3 0.4 0.2 0.6 0.9 0.2 1. ]
 [0.3 0.6 0.6 0.  0.  0.4 0.1 0.2 0. ]
 [0.2 0.4 0.4 0.3 0.1 0.5 0.1 0.1 1. ]
 [0.6 0.6 0.  0.  0.  0.5 0.  0.1 0. ]
 [0.1 1.  0.6 0.5 0.6 0.5 0.  0.5 1. ]
 [0.5 0.6 0.8 0.  0.  0.  0.1 0.6 1. ]]

从上面的输出中,所有数据都被重新缩放到了 0 到 1 的范围内。

归一化 (Normalization)

另一种有用的数据预处理技术是归一化。这用于重新缩放数据的每一行,使其长度为 1。它主要在稀疏数据集(其中有很多零)中很有用。我们可以借助 scikit-learn Python 库的 Normalizer 类来重新缩放数据。

归一化的类型

在机器学习中,有两种类型的归一化预处理技术如下:

L1 归一化

它可以定义为一种归一化技术,它修改数据集值的方式是每行中绝对值的总和始终为 1。它也称为最小绝对离差

示例

在此示例中,我们使用 L1 归一化技术来归一化我们之前使用的 Pima Indians 糖尿病数据集的数据。首先,将加载 CSV 数据,然后借助 Normalizer 类将其归一化。

以下脚本的前几行与我们之前加载 CSV 数据时所写的相同。

Python
from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer

path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)
array = dataframe.values

现在,我们可以使用带有 L1 范数的 Normalizer 类来归一化数据。

Python
Data_normalizer = Normalizer(norm='l1').fit(array)
Data_normalized = Data_normalizer.transform(array)

我们还可以根据我们的选择总结输出数据。在这里,我们将精度设置为 2,并在输出中显示前 3 行。

Python
set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])

输出

Normalized data:
[[0.02 0.43 0.21 0.1  0.   0.1  0.   0.14 0.  ]
 [0.   0.36 0.28 0.12 0.   0.11 0.   0.13 0.  ]
 [0.03 0.59 0.21 0.   0.   0.07 0.   0.1  0.  ]]
L2 归一化

它可以定义为一种归一化技术,它修改数据集值的方式是每行中平方和始终为 1。它也称为最小二乘

示例

在此示例中,我们使用 L2 归一化技术来归一化我们之前使用的 Pima Indians 糖尿病数据集的数据。首先,将加载 CSV 数据(如前几章中所述),然后借助 Normalizer 类将其归一化。

以下脚本的前几行与我们之前加载 CSV 数据时所写的相同。

Python
from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer

path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)
array = dataframe.values

现在,我们可以使用带有 L2 范数的 Normalizer 类来归一化数据。

Python
Data_normalizer = Normalizer(norm='l2').fit(array)
Data_normalized = Data_normalizer.transform(array)

我们还可以根据我们的选择总结输出数据。在这里,我们将精度设置为 2,并在输出中显示前 3 行。

Python
set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])

输出

Normalized data:
[[0.03 0.83 0.4  0.2  0.   0.19 0.   0.28 0.01]
 [0.01 0.72 0.56 0.24 0.   0.22 0.   0.26 0.  ]
 [0.04 0.92 0.32 0.   0.   0.12 0.   0.16 0.01]]

二值化 (Binarization)

顾名思义,这是一种可以使我们的数据二进制化的技术。我们可以使用二进制阈值使数据二进制化。高于该阈值的值将转换为 1,低于该阈值的值将转换为 0。例如,如果我们选择阈值为 0.5,则数据集值高于此值的将变为 1,低于此值的将变为 0。这就是为什么我们可以称之为数据二值化或数据阈值化。当我们的数据集中有概率并希望将它们转换为清晰的值时,此技术很有用。

我们可以借助 scikit-learn Python 库的 Binarizer 类来二值化数据。

示例

在此示例中,我们将重新缩放我们之前使用的 Pima Indians 糖尿病数据集的数据。首先,将加载 CSV 数据,然后借助 Binarizer 类将其转换为二进制值(即 0 和 1),具体取决于阈值。我们取 0.5 作为阈值。

以下脚本的前几行与我们之前加载 CSV 数据时所写的相同。

Python
from pandas import read_csv
from sklearn.preprocessing import Binarizer

path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

现在,我们可以使用 Binarizer 类将数据转换为二进制值。

Python
binarizer = Binarizer(threshold=0.5).fit(array)
Data_binarized = binarizer.transform(array)

在这里,我们显示输出中的前 5 行。

Python
print ("\nBinary data:\n", Data_binarized [0:5])

输出

Binary data:
[[1. 1. 1. 1. 0. 1. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 0. 1. 0.]
 [1. 1. 1. 0. 0. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 0. 1. 0.]
 [0. 1. 1. 1. 1. 1. 1. 1. 1.]]

标准化 (Standardization)

另一种有用的数据预处理技术,它基本上用于转换具有高斯分布的数据属性。它将均值和标准差(SD)转换为均值为 0,标准差为 1 的标准高斯分布。这种技术在诸如线性回归、逻辑回归等假设输入数据集具有高斯分布并使用重新缩放数据产生更好结果的机器学习算法中很有用。我们可以借助 scikit-learn Python 库的 StandardScaler 类来标准化数据(均值 = 0,标准差 = 1)。

示例

在此示例中,我们将重新缩放我们之前使用的 Pima Indians 糖尿病数据集的数据。首先,将加载 CSV 数据,然后借助 StandardScaler 类将其转换为均值 = 0,标准差 = 1 的高斯分布。

以下脚本的前几行与我们之前加载 CSV 数据时所写的相同。

Python
from sklearn.preprocessing import StandardScaler
from pandas import read_csv
from numpy import set_printoptions

path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

现在,我们可以使用 StandardScaler 类来重新缩放数据。

Python
data_scaler = StandardScaler().fit(array)
data_rescaled = data_scaler.transform(array)

我们还可以根据我们的选择总结输出数据。在这里,我们将精度设置为 2,并在输出中显示前 5 行。

Python
set_printoptions(precision=2)
print ("\nRescaled data:\n", data_rescaled [0:5])

输出

Rescaled data:
[[ 0.64  0.85  0.15  0.91 -0.69  0.2   0.47  1.43  1.37]
 [-0.84 -1.12 -0.16  0.53 -0.69 -0.68 -0.37 -0.19 -0.73]
 [ 1.23  1.94 -0.26 -1.29 -0.69 -1.1   0.6  -0.11  1.37]
 [-0.84 -1.  -0.16  0.15  0.12 -0.49 -0.92 -1.04 -0.73]
 [-1.14  0.5  -1.5   0.91  0.77  1.41  5.48 -0.02  1.37]]

数据标注 (Data Labeling)

我们讨论了良好数据对机器学习算法的重要性,以及在将数据发送给机器学习算法之前预处理数据的一些技术。在这方面,另一个方面是数据标注。将数据发送给机器学习算法时,具有正确的标注也非常重要。例如,在分类问题中,数据上有很多以单词、数字等形式存在的标签。

什么是标签编码 (Label Encoding)?

大多数 sklearn 函数都期望数据带有数字标签而不是单词标签。因此,我们需要将这些标签转换为数字标签。这个过程称为标签编码。我们可以借助 scikit-learn Python 库的 LabelEncoder() 函数执行数据标签编码。

示例

在以下示例中,Python 脚本将执行标签编码。

首先,导入所需的 Python 库,如下所示:

Python
import numpy as np
from sklearn import preprocessing

现在,我们需要提供输入标签,如下所示:

Python
input_labels = ['red','black','red','green','black','yellow','white']

下一行代码将创建标签编码器并训练它。

Python
encoder = preprocessing.LabelEncoder()
encoder.fit(input_labels)

接下来的脚本行将通过编码随机排序的列表来检查性能:

Python
test_labels = ['green','red','black']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)
print("Encoded values =", list(encoded_values))

encoded_values = [3,0,4,1] # 这里假设你已经有了这些编码值,用于演示逆转换
decoded_list = encoder.inverse_transform(encoded_values)

我们可以借助以下 Python 脚本获取编码值的列表:

Python
print("\nEncoded values =", encoded_values)
print("\nDecoded labels =", list(decoded_list))

输出

Labels = ['green', 'red', 'black']
Encoded values = [1, 2, 0]

Encoded values = [3, 0, 4, 1]
Decoded labels = ['white', 'black', 'yellow', 'green']


7. Machine Learning with Python
 Machine – Preparing Learning with Python
 Data
Introduction
Machine Learning algorithms are completely dependent on data because it is the most
crucial aspect that makes model training possible. On the other hand, if we won’t be able
to make sense out of that data, before feeding it to ML algorithms, a machine will be
useless. In simple words, we always need to feed right data i.e. the data in correct scale,
format and containing meaningful features, for the problem we want machine to solve.
This makes data preparation the most important step in ML process. Data preparation may
be defined as the procedure that makes our dataset more appropriate for ML process.
Why Data Pre-processing?
After selecting the raw data for ML training, the most important task is data pre-
processing. In broad sense, data preprocessing will convert the selected data into a form
we can work with or can feed to ML algorithms. We always need to preprocess our data
so that it can be as per the expectation of machine learning algorithm.
Data Pre-processing Techniques
We have the following data preprocessing techniques that can be applied on data set to
produce data for ML algorithms:
Scaling:
Most probably our dataset comprises of the attributes with varying scale, but we cannot
provide such data to ML algorithm hence it requires rescaling. Data rescaling makes sure
that attributes are at same scale. Generally, attributes are rescaled into the range of 0
and 1. ML algorithms like gradient descent and k-Nearest Neighbors requires scaled data.
We can rescale the data with the help of MinMaxScaler class of scikit-learn Python
library.
Example
In this example we will rescale the data of Pima Indians Diabetes dataset which we used
earlier. First, the CSV data will be loaded (as done in the previous chapters) and then with
the help of MinMaxScaler class, it will be rescaled in the range of 0 and 1.
The first few lines of the following script are same as we have written in previous chapters
while loading CSV data.
from pandas import read_csv
from numpy import set_printoptions
from sklearn import preprocessing
path = r'C:\pima-indians-diabetes.csv'
43
Machine Learning with Python
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
dataframe = read_csv(path, names=names)
array = dataframe.values
Now, we can use MinMaxScaler class to rescale the data in the range of 0 and 1.
data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
data_rescaled = data_scaler.fit_transform(array)
We can also summarize the data for output as per our choice. Here, we are setting the
precision to 1 and showing the first 10 rows in the output.
set_printoptions(precision=1)
print ("\nScaled data:\n", data_rescaled[0:10])
Output
Scaled data:
[[0.4 0.7 0.6 0.4 0.
 0.5 0.2 0.5 1. ]
[0.1 0.4 0.5 0.3 0.
 0.4 0.1 0.2 0. ]
[0.5 0.9 0.5 0.
 0.
 0.3 0.3 0.2 1. ]
[0.1 0.4 0.5 0.2 0.1 0.4 0.
 0.
 0. ]
[0.
 0.7 0.3 0.4 0.2 0.6 0.9 0.2 1. ]
[0.3 0.6 0.6 0.
 0.
 0.4 0.1 0.2 0. ]
[0.2 0.4 0.4 0.3 0.1 0.5 0.1 0.1 1. ]
[0.6 0.6 0.
 0.
 0.
 0.5 0.
 0.1 0. ]
[0.1 1.
 0.6 0.5 0.6 0.5 0.
 0.5 1. ]
[0.5 0.6 0.8 0.
 0.
 0.
 0.1 0.6 1. ]]
From the above output, all the data got rescaled into the range of 0 and 1.
Normalization
Another useful data preprocessing technique is Normalization. This is used to rescale each
row of data to have a length of 1. It is mainly useful in Sparse dataset where we have lots
of zeros. We can rescale the data with the help of Normalizer class of scikit-learn
Python library.
44
Machine Learning with Python
Types of Normalization
In machine learning, there are two types of normalization preprocessing techniques as
follows:
L1 Normalization
It may be defined as the normalization technique that modifies the dataset values in a
way that in each row the sum of the absolute values will always be up to 1. It is also called
Least Absolute Deviations.
Example
In this example, we use L1 Normalize technique to normalize the data of Pima Indians
Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with
the help of Normalizer class it will be normalized.
The first few lines of following script are same as we have written in previous chapters
while loading CSV data.
from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
dataframe = read_csv (path, names=names)
array = dataframe.values
Now, we can use Normalizer class with L1 to normalize the data.
Data_normalizer = Normalizer(norm='l1').fit(array)
Data_normalized = Data_normalizer.transform(array)
We can also summarize the data for output as per our choice. Here, we are setting the
precision to 2 and showing the first 3 rows in the output.
set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])
Output
Normalized data:
[[0.02 0.43 0.21 0.1
 0.
 0.1
 0.
 0.14 0.
 ]
[0.
 0.36 0.28 0.12 0.
 0.11 0.
 0.13 0.
 ]
[0.03 0.59 0.21 0.
 0.
 0.07 0.
 0.1
 0.
 ]]
45
Machine Learning with Python
L2 Normalization
It may be defined as the normalization technique that modifies the dataset values in a
way that in each row the sum of the squares will always be up to 1. It is also called least
squares.
Example
In this example, we use L2 Normalization technique to normalize the data of Pima Indians
Diabetes dataset which we used earlier. First, the CSV data will be loaded (as done in
previous chapters) and then with the help of Normalizer class it will be normalized.
The first few lines of following script are same as we have written in previous chapters
while loading CSV data.
from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
dataframe = read_csv (path, names=names)
array = dataframe.values
Now, we can use Normalizer class with L1 to normalize the data.
Data_normalizer = Normalizer(norm='l2').fit(array)
Data_normalized = Data_normalizer.transform(array)
We can also summarize the data for output as per our choice. Here, we are setting the
precision to 2 and showing the first 3 rows in the output.
set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])
Output
Normalized data:
[[0.03 0.83 0.4
 0.2
 0.
 0.19 0.
 0.28 0.01]
[0.01 0.72 0.56 0.24 0.
 0.22 0.
 0.26 0.
 ]
[0.04 0.92 0.32 0.
 0.
 0.12 0.
 0.16 0.01]]
Binarization
As the name suggests, this is the technique with the help of which we can make our data
binary. We can use a binary threshold for making our data binary. The values above that
threshold value will be converted to 1 and below that threshold will be converted to 0.
46
Machine Learning with Python
For example, if we choose threshold value = 0.5, then the dataset value above it will
become 1 and below this will become 0. That is why we can call it binarizing the data or
thresholding the data. This technique is useful when we have probabilities in our dataset
and want to convert them into crisp values.
We can binarize the data with the help of Binarizer class of scikit-learn Python library.
Example
In this example, we will rescale the data of Pima Indians Diabetes dataset which we used
earlier. First, the CSV data will be loaded and then with the help of Binarizer class it will
be converted into binary values i.e. 0 and 1 depending upon the threshold value. We are
taking 0.5 as threshold value.
The first few lines of following script are same as we have written in previous chapters
while loading CSV data.
from pandas import read_csv
from sklearn.preprocessing import Binarizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
dataframe = read_csv(path, names=names)
array = dataframe.values
Now, we can use Binarize class to convert the data into binary values.
binarizer = Binarizer(threshold=0.5).fit(array)
Data_binarized = binarizer.transform(array)
Here, we are showing the first 5 rows in the output.
print ("\nBinary data:\n", Data_binarized [0:5])
Output
Binary data:
[[1. 1. 1. 1. 0. 1. 1. 1. 1.]
[1. 1. 1. 1. 0. 1. 0. 1. 0.]
[1. 1. 1. 0. 0. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 0. 1. 0.]
[0. 1. 1. 1. 1. 1. 1. 1. 1.]]
47
Machine Learning with Python
Standardization
Another useful data preprocessing technique which is basically used to transform the data
attributes with a Gaussian distribution. It differs the mean and SD (Standard Deviation)
to a standard Gaussian distribution with a mean of 0 and a SD of 1. This technique is
useful in ML algorithms like linear regression, logistic regression that assumes a Gaussian
distribution in input dataset and produce better results with rescaled data. We can
standardize the data (mean = 0 and SD =1) with the help of StandardScaler class of
scikit-learn Python library.
Example
In this example, we will rescale the data of Pima Indians Diabetes dataset which we used
earlier. First, the CSV data will be loaded and then with the help of StandardScaler class
it will be converted into Gaussian Distribution with mean = 0 and SD = 1.
The first few lines of following script are same as we have written in previous chapters
while loading CSV data.
from sklearn.preprocessing import StandardScaler
from pandas import read_csv
from numpy import set_printoptions
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
dataframe = read_csv(path, names=names)
array = dataframe.values
Now, we can use StandardScaler class to rescale the data.
data_scaler = StandardScaler().fit(array)
data_rescaled = data_scaler.transform(array)
We can also summarize the data for output as per our choice. Here, we are setting the
precision to 2 and showing the first 5 rows in the output.
set_printoptions(precision=2)
print ("\nRescaled data:\n", data_rescaled [0:5])
Output
Rescaled data:
[[ 0.64
 0.85
 0.15
 0.91 -0.69
 0.2
 0.47
 1.43
 1.37]
[-0.84 -1.12 -0.16
 0.53 -0.69 -0.68 -0.37 -0.19 -0.73]
[ 1.23
 1.94 -0.26 -1.29 -0.69 -1.1
 0.6
 -0.11
 1.37]
[-0.84 -1.
 -0.16
 0.15
 0.12 -0.49 -0.92 -1.04 -0.73]
48
Machine Learning with Python
[-1.14
 0.5
 -1.5
 0.91
 0.77
 1.41
 5.48 -0.02
 1.37]]
Data Labeling
We discussed the importance of good fata for ML algorithms as well as some techniques
to pre-process the data before sending it to ML algorithms. One more aspect in this regard
is data labeling. It is also very important to send the data to ML algorithms having proper
labeling. For example, in case of classification problems, lot of labels in the form of words,
numbers etc. are there on the data.
What is Label Encoding?
Most of the sklearn functions expect that the data with number labels rather than word
labels. Hence, we need to convert such labels into number labels. This process is called
label encoding. We can perform label encoding of data with the help of LabelEncoder()
function of scikit-learn Python library.
Example
In the following example, Python script will perform the label encoding.
First, import the required Python libraries as follows:
import numpy as np
from sklearn import preprocessing
Now, we need to provide the input labels as follows:
input_labels = ['red','black','red','green','black','yellow','white']
The next line of code will create the label encoder and train it.
encoder = preprocessing.LabelEncoder()
encoder.fit(input_labels)
The next lines of script will check the performance by encoding the random ordered list:
test_labels = ['green','red','black']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)
print("Encoded values =", list(encoded_values))
encoded_values = [3,0,4,1]
decoded_list = encoder.inverse_transform(encoded_values)
We can get the list of encoded values with the help of following python script:
49
Machine Learning with Python
print("\nEncoded values =", encoded_values)
print("\nDecoded labels =", list(decoded_list))
Output
Labels = ['green', 'red', 'black']
Encoded values = [1, 2, 0]
Encoded values = [3, 0, 4, 1]
Decoded labels = ['white', 'black', 'yellow', 'green']

Last modified: Thursday, 26 June 2025, 9:48 AM