节: sklearn中的数据集(Datasets in sklearn ) | 机器学习Python教程

章节大纲

Scikit-learn 提供了大量的数据集，用于测试学习算法。它们主要分为三种类型：

打包数据 (Packaged Data)：这些小型数据集与 scikit-learn 安装包一同提供，可以使用 sklearn.datasets.load_* 工具进行加载。

可下载数据 (Downloadable Data)：这些大型数据集可供下载，scikit-learn 提供了简化下载过程的工具。这些工具可以在 sklearn.datasets.fetch_* 中找到。

生成数据 (Generated Data)：有几种数据集是基于随机种子从模型中生成的。这些可以在 sklearn.datasets.make_* 中获取。

您可以使用 IPython 的 Tab 补全功能来探索可用的数据集加载器、抓取器和生成器。在从 sklearn 导入 datasets 子模块后，输入：

datasets.load_<TAB>

或

datasets.fetch_<TAB>

或

datasets.make_<TAB>

即可查看可用函数的列表。

数据和标签的结构

Scikit-learn 中的数据在大多数情况下都保存为二维的 Numpy 数组，其形状为 (n, m)。许多算法也接受相同形状的 scipy.sparse 矩阵。

n (n_samples)：样本数量。每个样本都是一个需要处理（例如分类）的项。一个样本可以是一篇文档、一张图片、一段声音、一段视频、一个天文物体、数据库或 CSV 文件中的一行，或者任何您可以用一组固定的定量特征来描述的事物。

m (n_features)：特征数量，即可以定量描述每个项的独特属性的数量。特征通常是实数值，但在某些情况下也可以是布尔值或离散值。

Python

from sklearn import datasets

请注意：这些数据集中的许多都相当大，可能需要很长时间才能下载！

Scikit-learn makes available a host of
datasets for testing learning algorithms.
They come in three flavors:
•Packaged Data: these small
datasets are packaged with
the scikit-learn installation,
and can be downloaded
using the tools in
••sklearn.datasets.load_*
Downloadable Data: these larger datasets are available for download, and scikit-learn includes
tools which streamline this process. These tools can be found in
sklearn.datasets.fetch_*
Generated Data: there are several datasets which are generated from models based on a random
seed. These are available in the sklearn.datasets.make_*
You can explore the available dataset loaders, fetchers, and generators using IPython's tab-completion
functionality. After importing the datasets submodule from sklearn , type
datasets.load_<TAB>
or
datasets.fetch_<TAB>
or
datasets.make_<TAB>
to see a list of available functions.
STRUCTURE OF DATA AND LABELS
Data in scikit-learn is in most cases saved as two-dimensional Numpy arrays with the shapealgorithms also accept scipy.sparse matrices of the same shape.
(n, m) . Many
29
••n: (n_samples) The number of samples: each sample is an item to process (e.g. classify). A
sample can be a document, a picture, a sound, a video, an astronomical object, a row in database
or CSV file, or whatever you can describe with a fixed set of quantitative traits.
m: (n_features) The number of features or distinct traits that can be used to describe each item in
a quantitative manner. Features are generally real-valued, but may be Boolean or discrete-valued
in some cases.
from sklearn import datasets
Be warned: many of these datasets are quite large, and can take a long time to download!

章节大纲

数据和标签的结构