sklearn中的数据集(Datasets in sklearn )
Section outline
-
Scikit-learn 提供了大量的数据集,用于测试学习算法。它们主要分为三种类型:
-
打包数据 (Packaged Data):这些小型数据集与 scikit-learn 安装包一同提供,可以使用
sklearn.datasets.load_*
工具进行加载。 -
可下载数据 (Downloadable Data):这些大型数据集可供下载,scikit-learn 提供了简化下载过程的工具。这些工具可以在
sklearn.datasets.fetch_*
中找到。 -
生成数据 (Generated Data):有几种数据集是基于随机种子从模型中生成的。这些可以在
sklearn.datasets.make_*
中获取。
您可以使用 IPython 的 Tab 补全功能来探索可用的数据集加载器、抓取器和生成器。在从
sklearn
导入datasets
子模块后,输入:datasets.load_<TAB>
或
datasets.fetch_<TAB>
或
datasets.make_<TAB>
即可查看可用函数的列表。
数据和标签的结构
Scikit-learn 中的数据在大多数情况下都保存为二维的 Numpy 数组,其形状为
(n, m)
。许多算法也接受相同形状的scipy.sparse
矩阵。-
n
(n_samples):样本数量。每个样本都是一个需要处理(例如分类)的项。一个样本可以是一篇文档、一张图片、一段声音、一段视频、一个天文物体、数据库或 CSV 文件中的一行,或者任何您可以用一组固定的定量特征来描述的事物。 -
m
(n_features):特征数量,即可以定量描述每个项的独特属性的数量。特征通常是实数值,但在某些情况下也可以是布尔值或离散值。
Pythonfrom sklearn import datasets
请注意:这些数据集中的许多都相当大,可能需要很长时间才能下载!
Scikit-learn makes available a host of
datasets for testing learning algorithms.
They come in three flavors:
•Packaged Data: these small
datasets are packaged with
the scikit-learn installation,
and can be downloaded
using the tools in
••sklearn.datasets.load_*
Downloadable Data: these larger datasets are available for download, and scikit-learn includes
tools which streamline this process. These tools can be found in
sklearn.datasets.fetch_*
Generated Data: there are several datasets which are generated from models based on a random
seed. These are available in the sklearn.datasets.make_*
You can explore the available dataset loaders, fetchers, and generators using IPython's tab-completion
functionality. After importing the datasets submodule from sklearn , type
datasets.load_<TAB>
or
datasets.fetch_<TAB>
or
datasets.make_<TAB>
to see a list of available functions.
STRUCTURE OF DATA AND LABELS
Data in scikit-learn is in most cases saved as two-dimensional Numpy arrays with the shapealgorithms also accept scipy.sparse matrices of the same shape.
(n, m) . Many
29
••n: (n_samples) The number of samples: each sample is an item to process (e.g. classify). A
sample can be a document, a picture, a sound, a video, an astronomical object, a row in database
or CSV file, or whatever you can describe with a fixed set of quantitative traits.
m: (n_features) The number of features or distinct traits that can be used to describe each item in
a quantitative manner. Features are generally real-valued, but may be Boolean or discrete-valued
in some cases.
from sklearn import datasets
Be warned: many of these datasets are quite large, and can take a long time to download! -