Python 机器学习: 4. Python 机器学习 - 机器学习项目数据加载

假设您想启动一个机器学习项目，那么您首先需要什么最重要的事情？我们需要加载数据才能启动任何机器学习项目。关于数据，机器学习项目最常见的数据格式是 CSV（逗号分隔值）。

基本上，CSV 是一种简单的文件格式，用于以纯文本形式存储表格数据（数字和文本），例如电子表格。在 Python 中，我们可以通过不同的方式加载 CSV 数据，但在加载 CSV 数据之前，我们必须注意一些注意事项。

加载 CSV 数据时的注意事项

CSV 数据格式是机器学习数据最常见的格式，但在将其加载到我们的机器学习项目中时，我们需要注意以下主要注意事项：

文件头

在 CSV 数据文件中，文件头包含每个字段的信息。我们必须为文件头和数据文件使用相同的分隔符，因为是文件头指定了如何解释数据字段。

以下是与 CSV 文件头相关的两种情况，必须予以考虑：

情况一：数据文件带有文件头： 如果数据文件带有文件头，它将自动为数据的每一列分配名称。
情况二：数据文件没有文件头： 如果数据文件没有文件头，我们需要手动为数据的每一列分配名称。

在这两种情况下，我们都必须明确指定我们的 CSV 文件是否包含文件头。

注释

任何数据文件中的注释都有其重要性。在 CSV 数据文件中，注释以行首的哈希符号（#）表示。在将 CSV 数据加载到机器学习项目时，我们需要考虑注释，因为如果文件中包含注释，我们可能需要根据我们选择的加载方法来指示是否预期这些注释。

分隔符

在 CSV 数据文件中，逗号（,）字符是标准分隔符。分隔符的作用是分隔字段中的值。在将 CSV 文件上传到机器学习项目时，考虑分隔符的作用很重要，因为我们也可以使用不同的分隔符，例如制表符或空格。但在使用不同于标准的分隔符时，我们必须明确指定它。

引号

在 CSV 数据文件中，双引号（" "）是默认的引用字符。在将 CSV 文件上传到机器学习项目时，考虑引号的作用很重要，因为我们也可以使用双引号之外的其他引用字符。但在使用不同于标准的引用字符时，我们必须明确指定它。

加载 CSV 数据文件的方法

在处理机器学习项目时，最关键的任务是正确地加载数据。机器学习项目最常见的数据格式是 CSV，它有各种形式和不同的解析难度。在本节中，我们将讨论 Python 中加载 CSV 数据文件的三种常见方法：

使用 Python 标准库加载 CSV

加载 CSV 数据文件的第一种也是最常用的方法是使用 Python 标准库，它为我们提供了各种内置模块，即 csv 模块和 reader() 函数。以下是使用它加载 CSV 数据文件的示例：

示例

在此示例中，我们使用 iris 花卉数据集，该数据集可以下载到我们的本地目录中。加载数据文件后，我们可以将其转换为 NumPy 数组并用于机器学习项目。以下是加载 CSV 数据文件的 Python 脚本：

首先，我们需要导入 Python 标准库提供的 csv 模块，如下所示：

Python

import csv

接下来，我们需要导入 Numpy 模块，用于将加载的数据转换为 NumPy 数组。

Python

import numpy as np

现在，提供存储在我们的本地目录中的 CSV 数据文件的完整路径：

Python

path = r"c:\iris.csv"

接下来，使用 csv.reader() 函数从 CSV 文件中读取数据：

Python

with open(path,'r') as f:
    reader = csv.reader(f,delimiter = ',')
    headers = next(reader)
    data = list(reader)
    data = np.array(data).astype(float)

我们可以使用以下脚本行打印文件头的名称：

Python

print(headers)

以下脚本行将打印数据的形状，即文件中的行数和列数：

Python

print(data.shape)

下一脚本行将给出数据文件的前三行：

Python

print(data[:3])

输出

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
(150, 4)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]]

使用 NumPy 加载 CSV

加载 CSV 数据文件的另一种方法是 NumPy 和 numpy.loadtxt() 函数。以下是使用它加载 CSV 数据文件的示例：

示例

在此示例中，我们使用 Pima Indians 数据集，其中包含糖尿病患者的数据。此数据集是没有任何文件头的数值数据集。它也可以下载到我们的本地目录中。加载数据文件后，我们可以将其转换为 NumPy 数组并用于机器学习项目。以下是加载 CSV 数据文件的 Python 脚本：

Python

from numpy import loadtxt

path = r"C:\pima-indians-diabetes.csv"
datapath= open(path, 'r')
data = loadtxt(datapath, delimiter=",")

print(data.shape)
print(data[:3])

输出

(768, 9)
[[  6.  148.   72.   35.    0.   33.6   0.627  50.    1.]
 [  1.   85.   66.   29.    0.   26.6   0.351  31.    0.]
 [  8.  183.   64.    0.    0.   23.3   0.672  32.    1.]]

使用 Pandas 加载 CSV

加载 CSV 数据文件的另一种方法是使用 Pandas 和 pandas.read_csv() 函数。这是一个非常灵活的函数，它返回一个 pandas.DataFrame，可以立即用于绘图。以下是使用它加载 CSV 数据文件的示例：

示例

在这里，我们将实现两个 Python 脚本，第一个使用带有文件头的 Iris 数据集，另一个使用没有文件头的 Pima Indians 数据集。这两个数据集都可以下载到本地目录。

脚本-1

以下是使用 Pandas 在 Iris 数据集上加载 CSV 数据文件的 Python 脚本：

Python

from pandas import read_csv

path = r"C:\iris.csv"
data = read_csv(path)

print(data.shape)
print(data[:3])

输出：

(150, 4)
   sepal_length  sepal_width  petal_length  petal_width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2

脚本-2

以下是使用 Pandas 在 Pima Indians Diabetes 数据集上加载 CSV 数据文件（并提供文件头名称）的 Python 脚本：

Python

from pandas import read_csv

path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)

print(data.shape)
print(data[:3])

输出

(768, 9)
   preg  plas  pres  skin  test  mass   pedi  age  class
0     6   148    72    35     0  33.6  0.627   50      1
1     1    85    66    29     0  26.6  0.351   31      0
2     8   183    64     0     0  23.3  0.672   32      1

通过上述用于加载 CSV 数据文件的三种方法示例，可以轻松理解它们之间的区别。

4.Machine Learning with Python – Machine
Data Learning
Loading
with Python
for ML
Projects
Suppose if you want to start a ML project then what is the first and most important thing
you would require? It is the data that we need to load for starting any of the ML project.
With respect to data, the most common format of data for ML projects is CSV (comma-
separated values).
Basically, CSV is a simple file format which is used to store tabular data (number and text)
such as a spreadsheet in plain text. In Python, we can load CSV data into with different
ways but before loading CSV data we must have to take care about some considerations.
Consideration While Loading CSV data
CSV data format is the most common format for ML data, but we need to take care about
following major considerations while loading the same into our ML projects:
File Header
In CSV data files, the header contains the information for each field. We must use the
same delimiter for the header file and for data file because it is the header file that specifies
how should data fields be interpreted.
The following are the two cases related to CSV file header which must be considered:

Case-I: When Data file is having a file header: It will automatically assign the
names to each column of data if data file is having a file header.

Case-II: When Data file is not having a file header: We need to assign the
names to each column of data manually if data file is not having a file header.
In both the cases, we must need to specify explicitly weather our CSV file contains header
or not.
Comments
Comments in any data file are having their significance. In CSV data file, comments are
indicated by a hash (#) at the start of the line. We need to consider comments while
loading CSV data into ML projects because if we are having comments in the file then we
may need to indicate, depends upon the method we choose for loading, whether to expect
those comments or not.
Delimiter
In CSV data files, comma (,) character is the standard delimiter. The role of delimiter is to
separate the values in the fields. It is important to consider the role of delimiter while
uploading the CSV file into ML projects because we can also use a different delimiter such
as a tab or white space. But in the case of using a different delimiter than standard one,
we must have to specify it explicitly.
22
Machine Learning with Python
Quotes
In CSV data files, double quotation (“ ”) mark is the default quote character. It is
important to consider the role of quotes while uploading the CSV file into ML projects
because we can also use other quote character than double quotation mark. But in case
of using a different quote character than standard one, we must have to specify it
explicitly.
Methods to Load CSV Data File
While working with ML projects, the most crucial task is to load the data properly into it.
The most common data format for ML projects is CSV and it comes in various flavors and
varying difficulties to parse. In this section, we are going to discuss about three common
approaches in Python to load CSV data file:
Load CSV with Python Standard Library
The first and most used approach to load CSV data file is the use of Python standard library
which provides us a variety of built-in modules namely csv module and the
reader()function. The following is an example of loading CSV data file with the help of
it:
Example
In this example, we are using the iris flower data set which can be downloaded into our local
directory. After loading the data file, we can convert it into NumPy array and use it for ML
projects. Following is the Python script for loading CSV data file:
First, we need to import the csv module provided by Python standard library as follows:
import csv
Next, we need to import Numpy module for converting the loaded data into NumPy array.
import numpy as np
Now, provide the full path of the file, stored on our local directory, having the CSV data
file:
path = r"c:\iris.csv"
Next, use the csv.reader()function to read data from CSV file:
with open(path,'r') as f:
reader = csv.reader(f,delimiter = ',')
headers = next(reader)
data = list(reader)
data = np.array(data).astype(float)
23
Machine Learning with Python
We can print the names of the headers with the following line of script:
print(headers)
The following line of script will print the shape of the data i.e. number of rows & columns
in the file:
print(data.shape)
Next script line will give the first three line of data file:
print(data[:3])
Output
['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
(150, 4)
[[5.1 3.5 1.4 0.2]
[4.9 3.
1.4 0.2]
[4.7 3.2 1.3 0.2]]
Load CSV with NumPy
Another approach to load CSV data file is NumPy and numpy.loadtxt() function. The
following is an example of loading CSV data file with the help of it:
Example
In this example, we are using the Pima Indians Dataset having the data of diabetic
patients. This dataset is a numeric dataset with no header. It can also be downloaded into
our local directory. After loading the data file, we can convert it into NumPy array and use
it for ML projects. The following is the Python script for loading CSV data file:
from numpy import loadtxt
path = r"C:\pima-indians-diabetes.csv"
datapath= open(path, 'r')
data = loadtxt(datapath, delimiter=",")
print(data.shape)
print(data[:3])
24
Output
Machine Learning with Python
(768, 9)
[[ 6.
148.
72.
35.
0.
33.6
0.627
50.
1.]
[ 1.
85.
66.
29.
0.
26.6
0.351
31.
0.]
[ 8.
183.
64.
0.
0.
23.3
0.672
32.
1.]]
Load CSV with Pandas
Another approach to load CSV data file is by Pandas and pandas.read_csv()function.
This is the very flexible function that returns a pandas.DataFrame which can be used
immediately for plotting. The following is an example of loading CSV data file with the help
of it:
Example
Here, we will be implementing two Python scripts, first is with Iris data set having headers
and another is by using the Pima Indians Dataset which is a numeric dataset with no header.
Both the datasets can be downloaded into local directory.
Script-1
The following is the Python script for loading CSV data file using Pandas on Iris Data set:
from pandas import read_csv
path = r"C:\iris.csv"
data = read_csv(path)
print(data.shape)
print(data[:3])
Output:
(150, 4)
sepal_length
sepal_width
petal_length
petal_width
0
5.1
3.5
1.4
0.2
1
4.9
3.0
1.4
0.2
2
4.7
3.2
1.3
0.2
25
Machine Learning with Python
Script-2
The following is the Python script for loading CSV data file, along with providing the
headers names too, using Pandas on Pima Indians Diabetes dataset:
from pandas import read_csv
path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
data = read_csv(path, names=headernames)
print(data.shape)
print(data[:3])
Output
(768, 9)
preg
plas
pres
skin
test
mass
pedi
age
class
0
6
148
72
35
0
33.6
0.627
50
1
1
1
85
66
29
0
26.6
0.351
31
0
2
8
183
64
0
0
23.3
0.672
32
1
The difference between above used three approaches for loading CSV data file can easily
be understood with the help of given examples.

Last modified: Thursday, 26 June 2025, 9:34 AM