引言

在处理机器学习项目时,我们通常会忽略两个最重要的部分,即数学和数据。这是因为,我们知道机器学习是一种数据驱动的方法,我们的机器学习模型产生的结果好坏取决于我们提供给它的数据。

在上一章中,我们讨论了如何将 CSV 数据上传到我们的机器学习项目中,但在上传之前理解数据会更好。我们可以通过两种方式理解数据:通过统计学和通过可视化。

在本章中,我们将借助以下 Python 代码来通过统计学理解机器学习数据。

查看原始数据

第一个方法是查看原始数据。查看原始数据很重要,因为在查看原始数据后获得的洞察力将提高我们更好地预处理和处理机器学习项目数据的机会。

以下是一个 Python 脚本,通过在 Pima Indians 糖尿病数据集上使用 Pandas DataFrame 的 head() 函数来查看前 50 行,以更好地理解它:

示例

Python
from pandas import read_csv

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
print(data.head(50))

输出

    preg  plas  pres  skin  test  mass   pedi  age  class
0      6   148    72    35     0  33.6  0.627   50      1
1      1    85    66    29     0  26.6  0.351   31      0
2      8   183    64     0     0  23.3  0.672   32      1
3      1    89    66    23    94  28.1  0.167   21      0
4      0   137    40    35   168  43.1  2.288   33      1
5      5   116    74     0     0  25.6  0.201   30      0
6      3    78    50    32    88  31.0  0.248   26      1
7     10   115     0     0     0  35.3  0.134   29      0
8      2   197    70    45   543  30.5  0.158   53      1
9      8   125    96     0     0   0.0  0.232   54      1
10     4   110    92     0     0  37.6  0.191   30      0
11    10   168    74     0     0  38.0  0.537   34      1
12    10   139    80     0     0  27.1  1.441   57      0
13     1   189    60    23   846  30.1  0.398   59      1
14     5   166    72    19   175  25.8  0.587   51      1
15     7   100     0     0     0  30.0  0.484   32      1
16     0   118    84    47   230  45.8  0.551   31      1
17     7   107    74     0     0  29.6  0.254   31      1
18     1   103    30    38    83  43.3  0.183   33      0
19     1   115    70    30    96  34.6  0.529   32      1
20     3   126    88    41   235  39.3  0.704   27      0
21     8    99    84     0     0  35.4  0.388   50      0
22     7   196    90     0     0  39.8  0.451   41      1
23     9   119    80    35     0  29.0  0.263   29      1
24    11   143    94    33   146  36.6  0.254   51      1
25    10   125    70    26   115  31.1  0.205   41      1
26     7   147    76     0     0  39.4  0.257   43      1
27     1    97    66    15   140  23.2  0.487   22      0
28    13   145    82    19   110  22.2  0.245   57      0
29     5   117    92     0     0  34.1  0.337   38      0
30     5   109    75    26     0  36.0  0.546   60      0
31     3   158    76    36   245  31.6  0.851   28      1
32     3    88    58    11    54  24.8  0.267   22      0
33     6    92    92     0     0  19.9  0.188   28      0
34    10   122    78    31     0  27.6  0.512   45      0
35     4   103    60    33   192  24.0  0.966   33      0
36    11   138    76     0     0  33.2  0.420   35      0
37     9   102    76    37     0  32.9  0.665   46      1
38     2    90    68    42     0  38.2  0.503   27      1
39     4   111    72    47   207  37.1  1.390   56      1
40     3   180    64    25    70  34.0  0.271   26      0
41     7   133    84     0     0  40.2  0.696   37      0
42     7   106    92    18     0  22.7  0.235   48      0
43     9   171   110    24   240  45.4  0.721   54      1
44     7   159    64     0     0  27.4  0.294   40      0
45     0   180    66    39     0  42.0  1.893   25      1
46     1   146    56     0     0  29.7  0.564   29      0
47     2    71    70    27     0  28.0  0.586   22      0
48     7   103    66    32     0  39.1  0.344   31      1
49     7   105     0     0     0   0.0  0.305   24      0

从上面的输出中,我们可以观察到第一列给出的是行号,这对于引用特定观测值非常有用。

检查数据维度

了解我们的机器学习项目有多少行和多少列的数据总是一个好习惯。原因如下:

  • 假设我们有太多的行和列,那么运行算法和训练模型将花费很长时间。
  • 假设我们有太少的行和列,那么我们将没有足够的数据来充分训练模型。

以下是一个通过在 Pandas DataFrame 上打印 shape 属性实现的 Python 脚本。我们将在 iris 数据集上实现它,以获取其中的总行数和列数。

示例

Python
from pandas import read_csv

path = r"C:\iris.csv"
data = read_csv(path)
print(data.shape)

输出

(150, 4)

从输出中我们可以很容易地观察到,我们将使用的 iris 数据集有 150 行和 4 列。

获取每个属性的数据类型

了解每个属性的数据类型是另一个好习惯。原因在于,根据需要,有时我们可能需要将一种数据类型转换为另一种数据类型。例如,我们可能需要将字符串转换为浮点数或整数以表示类别或序数值。我们可以通过查看原始数据来了解属性的数据类型,但另一种方法是使用 Pandas DataFrame 的 dtypes 属性。借助 dtypes 属性,我们可以对每个属性的数据类型进行分类。可以通过以下 Python 脚本来理解:

示例

Python
from pandas import read_csv

path = r"C:\iris.csv"
data = read_csv(path)
print(data.dtypes)

输出

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
dtype: object

从上面的输出中,我们可以轻松获得每个属性的数据类型。

数据的统计摘要

我们已经讨论了获取数据形状(即行数和列数)的 Python 方法,但在很多时候我们需要查看该数据形状的摘要。这可以通过 Pandas DataFrame 的 describe() 函数来完成,该函数进一步提供了每个数据属性的以下 8 个统计属性:

  • 计数(Count)
  • 均值(Mean)
  • 标准差(Standard Deviation)
  • 最小值(Minimum Value)
  • 最大值(Maximum value)
  • 25% 分位数(25%)
  • 中位数(即 50% 分位数)(Median i.e. 50%)
  • 75% 分位数(75%)

示例

Python
from pandas import read_csv
from pandas import set_option

path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
set_option('display.width', 100)
set_option('precision', 2)
print(data.shape)
print(data.describe())

输出

(768, 9)
       preg   plas  pres  skin   test   mass  pedi   age  class
count  768.00  768.00  768.00  768.00  768.00  768.00  768.00  768.00  768.00
mean     3.85  120.89   69.11   20.54   79.80   31.99   0.47   33.24   0.35
std      3.37   31.97   19.36   15.95  115.24    7.88   0.33   11.76   0.48
min      0.00    0.00    0.00    0.00    0.00    0.00   0.08   21.00   0.00
25%      1.00   99.00   62.00    0.00    0.00   27.30   0.24   24.00   0.00
50%      3.00  117.00   72.00   23.00   30.50   32.00   0.37   29.00   0.00
75%      6.00  140.25   80.00   32.00  127.25   36.60   0.63   41.00   1.00
max     17.00  199.00  122.00   99.00  846.00   67.10   2.42   81.00   1.00

从上面的输出中,我们可以观察到 Pima Indians 糖尿病数据集的数据统计摘要以及数据的形状。

审查类分布

类分布统计在分类问题中很有用,我们需要了解类值的平衡性。了解类值分布很重要,因为如果我们的类分布高度不平衡(即一个类别的观测值比另一个类别多得多),那么在机器学习项目的数据准备阶段可能需要特殊处理。我们可以借助 Pandas DataFrame 轻松获取 Python 中的类分布。

示例

Python
from pandas import read_csv

path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
count_class = data.groupby('class').size()
print(count_class)

输出:

class
0    500
1    268
dtype: int64

从上面的输出中可以清楚地看到,类别 0 的观测值数量几乎是类别 1 的两倍。

审查属性间的相关性

两个变量之间的关系称为相关性。在统计学中,计算相关性最常用的方法是 Pearson 相关系数。它可以有三个值,如下所示:

  • 系数值为 1: 表示变量之间完全正相关。
  • 系数值为 -1: 表示变量之间完全负相关。
  • 系数值为 0: 表示变量之间完全没有相关性。

在将数据集用于机器学习项目之前,审查数据集中属性的成对相关性总是好的,因为如果我们有高度相关的属性,一些机器学习算法(如线性回归和逻辑回归)将表现不佳。在 Python 中,我们可以借助 Pandas DataFrame 的 corr() 函数轻松计算数据集属性的相关矩阵。

示例

Python
from pandas import read_csv
from pandas import set_option

path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
set_option('display.width', 100)
set_option('precision', 2)
correlations = data.corr(method='pearson')
print(correlations)

输出

      preg  plas  pres  skin  test  mass  pedi   age  class
preg  1.00  0.13  0.14 -0.08 -0.07  0.02 -0.03  0.54   0.22
plas  0.13  1.00  0.15  0.06  0.33  0.22  0.14  0.26   0.47
pres  0.14  0.15  1.00  0.21  0.09  0.28  0.04  0.24   0.07
skin -0.08  0.06  0.21  1.00  0.44  0.39  0.18 -0.11   0.07
test -0.07  0.33  0.09  0.44  1.00  0.20  0.19 -0.04   0.13
mass  0.02  0.22  0.28  0.39  0.20  1.00  0.14  0.04   0.29
pedi -0.03  0.14  0.04  0.18  0.19  0.14  1.00  0.03   0.17
age   0.54  0.26  0.24 -0.11 -0.04  0.04  0.03  1.00   0.24
class 0.22  0.47  0.07  0.07  0.13  0.29  0.17  0.24   1.00

上面输出中的矩阵给出了数据集中所有属性对之间的相关性。

审查属性分布的偏度

偏度可以定义为假设为高斯分布但向一个方向或另一个方向(向左或向右)出现扭曲或偏移的分布。审查属性的偏度是重要的任务之一,原因如下:

  • 数据中存在偏度需要在数据准备阶段进行校正,以便我们可以从模型中获得更高的准确性。
  • 大多数机器学习算法假设数据具有高斯分布,即正态或钟形曲线数据。

在 Python 中,我们可以通过在 Pandas DataFrame 上使用 skew() 函数轻松计算每个属性的偏度。

示例

Python
from pandas import read_csv

path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=names)
print(data.skew())

输出

preg     0.90
plas     0.17
pres    -1.84
skin     0.11
test     2.27
mass    -0.43
pedi     1.92
age      1.13
class    0.64
dtype: float64

从上面的输出中,可以观察到正偏度或负偏度。如果值接近零,则表示偏度较小。


5.Machine Learning with Python – Understanding
 Machine Learning with Python
 Data with
Statistics
Introduction
While working with machine learning projects, usually we ignore two most important parts
called mathematics and data. It is because, we know that ML is a data driven approach
and our ML model will produce only as good or as bad results as the data we provided to
it.
In the previous chapter, we discussed how we can upload CSV data into our ML project,
but it would be good to understand the data before uploading it. We can understand the
data by two ways, with statistics and with visualization.
In this chapter, with the help of following Python recipes, we are going to understand ML
data with statistics.
Looking at Raw Data
The very first recipe is for looking at your raw data. It is important to look at raw data
because the insight we will get after looking at raw data will boost our chances to better
pre-processing as well as handling of data for ML projects.
Following is a Python script implemented by using head() function of Pandas DataFrame
on Pima Indians diabetes dataset to look at the first 50 rows to get better understanding
of it:
Example
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
print(data.head(50))
Output
preg
 plas
 pres
 skin
 test
 mass
 pedi
 age
 class
0
 6
 148
 72
 35
 0
 33.6
 0.627
 50
 1
1
 1
 85
 66
 29
 0
 26.6
 0.351
 31
 0
2
 8
 183
 64
 0
 0
 23.3
 0.672
 32
 1
3
 1
 89
 66
 23
 94
 28.1
 0.167
 21
 0
4
 0
 137
 40
 35
 168
 43.1
 2.288
 33
 1
27
Machine Learning5
 5
 116
 74
 0
 0
 25.6
 0.201
 30
 0
6
 3
 78
 50
 32
 88
 31.0
 0.248
 26
 1
7
 10
 115
 0
 0
 0
 35.3
 0.134
 29
 0
8
 2
 197
 70
 45
 543
 30.5
 0.158
 53
 1
9
 8
 125
 96
 0
 0
 0.0
 0.232
 54
 1
10
 4
 110
 92
 0
 0
 37.6
 0.191
 30
 0
11
 10
 168
 74
 0
 0
 38.0
 0.537
 34
 1
12
 10
 139
 80
 0
 0
 27.1
 1.441
 57
 0
13
 1
 189
 60
 23
 846
 30.1
 0.398
 59
 1
14
 5
 166
 72
 19
 175
 25.8
 0.587
 51
 1
15
 7
 100
 0
 0
 0
 30.0
 0.484
 32
 1
16
 0
 118
 84
 47
 230
 45.8
 0.551
 31
 1
17
 7
 107
 74
 0
 0
 29.6
 0.254
 31
 1
18
 1
 103
 30
 38
 83
 43.3
 0.183
 33
 0
19
 1
 115
 70
 30
 96
 34.6
 0.529
 32
 1
20
 3
 126
 88
 41
 235
 39.3
 0.704
 27
 0
21
 8
 99
 84
 0
 0
 35.4
 0.388
 50
 0
22
 7
 196
 90
 0
 0
 39.8
 0.451
 41
 1
23
 9
 119
 80
 35
 0
 29.0
 0.263
 29
 1
24
 11
 143
 94
 33
 146
 36.6
 0.254
 51
 1
25
 10
 125
 70
 26
 115
 31.1
 0.205
 41
 1
26
 7
 147
 76
 0
 0
 39.4
 0.257
 43
 1
27
 1
 97
 66
 15
 140
 23.2
 0.487
 22
 0
28
 13
 145
 82
 19
 110
 22.2
 0.245
 57
 0
29
 5
 117
 92
 0
 0
 34.1
 0.337
 38
 0
30
 5
 109
 75
 26
 0
 36.0
 0.546
 60
 0
31
 3
 158
 76
 36
 245
 31.6
 0.851
 28
 1
32
 3
 88
 58
 11
 54
 24.8
 0.267
 22
 0
33
 6
 92
 92
 0
 0
 19.9
 0.188
 28
 0
34
 10
 122
 78
 31
 0
 27.6
 0.512
 45
 0
35
 4
 103
 60
 33
 192
 24.0
 0.966
 33
 0
36
 11
 138
 76
 0
 0
 33.2
 0.420
 35
 0
37
 9
 102
 76
 37
 0
 32.9
 0.665
 46
 1
38
 2
 90
 68
 42
 0
 38.2
 0.503
 27
 1
39
 4
 111
 72
 47
 207
 37.1
 1.390
 56
 1
40
 3
 180
 64
 25
 70
 34.0
 0.271
 26
 0
41
 7
 133
 84
 0
 0
 40.2
 0.696
 37
 0
withPython
28
Machine Learning with Python
42
 7
 106
 92
 18
 0
 22.7
 0.235
 48
 0
43
 9
 171
 110
 24
 240
 45.4
 0.721
 54
 1
44
 7
 159
 64
 0
 0
 27.4
 0.294
 40
 0
45
 0
 180
 66
 39
 0
 42.0
 1.893
 25
 1
46
 1
 146
 56
 0
 0
 29.7
 0.564
 29
 0
47
 2
 71
 70
 27
 0
 28.0
 0.586
 22
 0
48
 7
 103
 66
 32
 0
 39.1
 0.344
 31
 1
49
 7
 105
 0
 0
 0
 0.0
 0.305
 24
 0
We can observe from the above output that first column gives the row number which can
be very useful for referencing a specific observation.
Checking Dimensions of Data
It is always a good practice to know how much data, in terms of rows and columns, we
are having for our ML project. The reasons behind are:

 Suppose if we have too many rows and columns then it would take long time to
run the algorithm and train the model.

Suppose if we have too less rows and columns then it we would not have enough
data to well train the model.
Following is a Python script implemented by printing the shape property on Pandas Data
Frame. We are going to implement it on iris data set for getting the total number of rows
and columns in it.
Example
from pandas import read_csv
path = r"C:\iris.csv"
data = read_csv(path)
print(data.shape)
Output
(150, 4)
We can easily observe from the output that iris data set, we are going to use, is having
150 rows and 4 columns.
Getting Each Attribute’s Data Type
It is another good practice to know data type of each attribute. The reason behind is that,
as per to the requirement, sometimes we may need to convert one data type to another.
For example, we may need to convert string into floating point or int for representing
categorial or ordinal values. We can have an idea about the attribute’s data type by looking
at the raw data, but another way is to use dtypes property of Pandas DataFrame. With
29
Machine Learning with Python
the help of dtypes property we can categorize each attributes data type. It can be
understood with the help of following Python script:
Example
from pandas import read_csv
path = r"C:\iris.csv"
data = read_csv(path)
print(data.dtypes)
Output
sepal_length
 float64
sepal_width
 float64
petal_length
 float64
petal_width
 float64
dtype: object
From the above output, we can easily get the datatypes of each attribute.
Statistical Summary of Data
We have discussed Python recipe to get the shape i.e. number of rows and columns, of
data but many times we need to review the summaries out of that shape of data. It can
be done with the help of describe() function of Pandas DataFrame that further provide
the following 8 statistical properties of each & every data attribute:

 Count

 Mean

 Standard Deviation

 Minimum Value

 Maximum value

 25%

 Median i.e. 50%

 75%
Example
from pandas import read_csv
from pandas import set_option
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=names)
30
Machine Learning with Python
set_option('display.width', 100)
set_option('precision', 2)
print(data.shape)
print(data.describe())
Output
(768, 9)
preg
 plas
 pres
 skin
 test
 mass
 pedi
 age
 class
count
 768.00
 768.00
 768.00
 768.00
 768.00
 768.00
 768.00
 768.00
 768.00
mean
 3.85
 120.89
 69.11
 20.54
 79.80
 31.99
 0.47
 33.24
 0.35
std
 3.37
 31.97
 19.36
 15.95
 115.24
 7.88
 0.33
 11.76
 0.48
min
 0.00
 0.00
 0.00
 0.00
 0.00
 0.00
 0.08
 21.00
 0.00
25%
 1.00
 99.00
 62.00
 0.00
 0.00
 27.30
 0.24
 24.00
 0.00
50%
 3.00
 117.00
 72.00
 23.00
 30.50
 32.00
 0.37
 29.00
 0.00
75%
 6.00
 140.25
 80.00
 32.00
 127.25
 36.60
 0.63
 41.00
 1.00
max
 17.00
 199.00
 122.00
 99.00
 846.00
 67.10
 2.42
 81.00
 1.00
From the above output, we can observe the statistical summary of the data of Pima Indian
Diabetes dataset along with shape of data.
Reviewing Class Distribution
Class distribution statistics is useful in classification problems where we need to know the
balance of class values. It is important to know class value distribution because if we have
highly imbalanced class distribution i.e. one class is having lots more observations than
other class, then it may need special handling at data preparation stage of our ML project.
We can easily get class distribution in Python with the help of Pandas DataFrame.
Example
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=names)
count_class = data.groupby('class').size()
print(count_class)
Output:
Class
0
 500
31
Machine Learning with Python
1
 268
dtype: int64
From the above output, it can be clearly seen that the number of observations with class
0 are almost double than number of observations with class 1.
Reviewing Correlation between Attributes
The relationship between two variables is called correlation. In statistics, the most common
method for calculating correlation is Pearson’s Correlation Coefficient. It can have three
values as follows:

 Coefficient value = 1: It represents full positive correlation between variables.

 Coefficient value = -1: It represents full negative correlation between variables.

 Coefficient value = 0: It represents no correlation at all between variables.
It is always good for us to review the pairwise correlations of the attributes in our dataset
before using it into ML project because some machine learning algorithms such as linear
regression and logistic regression will perform poorly if we have highly correlated
attributes. In Python, we can easily calculate a correlation matrix of dataset attributes with
the help of corr() function on Pandas DataFrame.
Example
from pandas import read_csv
from pandas import set_option
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=names)
set_option('display.width', 100)
set_option('precision', 2)
correlations = data.corr(method='pearson')
print(correlations)
Output
preg
 plas
 pres
 skin
 test
 mass
 pedi
 age
 class
preg
 1.00
 0.13
 0.14 -0.08 -0.07
 0.02 -0.03
 0.54
 0.22
plas
 0.13
 1.00
 0.15
 0.06
 0.33
 0.22
 0.14
 0.26
 0.47
pres
 0.14
 0.15
 1.00
 0.21
 0.09
 0.28
 0.04
 0.24
 0.07
skin
 -0.08
 0.06
 0.21
 1.00
 0.44
 0.39
 0.18 -0.11
 0.07
test
 -0.07
 0.33
 0.09
 0.44
 1.00
 0.20
 0.19 -0.04
 0.13
mass
 0.02
 0.22
 0.28
 0.39
 0.20
 1.00
 0.14
 0.04
 0.29
32
Machine Learning with Python
pedi
 -0.03
 0.14
 0.04
 0.18
 0.19
 0.14
 1.00
 0.03
 0.17
age
 0.54
 0.26
 0.24 -0.11 -0.04
 0.04
 0.03
 1.00
 0.24
class
 0.22
 0.47
 0.07
 0.07
 0.13
 0.29
 0.17
 0.24
 1.00
The matrix in above output gives the correlation between all the pairs of the attribute in
dataset.
Reviewing Skew of Attribute Distribution
Skewness may be defined as the distribution that is assumed to be Gaussian but appears
distorted or shifted in one direction or another, or either to the left or right. Reviewing the
skewness of attributes is one of the important tasks due to following reasons:

Presence of skewness in data requires the correction at data preparation stage so
that we can get more accuracy from our model.

Most of the ML algorithms assumes that data has a Gaussian distribution i.e. either
normal of bell curved data.
In Python, we can easily calculate the skew of each attribute by using skew() function on
Pandas DataFrame.
Example
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=names)
print(data.skew())
Output
preg
 0.90
plas
 0.17
pres
 -1.84
skin
 0.11
test
 2.27
mass
 -0.43
pedi
 1.92
age
 1.13
class
 0.64
dtype: float64
From the above output, positive or negative skew can be observed. If the value is closer
to zero, then it shows less skew.

最后修改: 2025年06月26日 星期四 09:37