章节大纲

  • Scikit-learn 提供了大量的数据集,用于测试学习算法。它们主要分为三种类型:

    • 打包数据 (Packaged Data):这些小型数据集与 scikit-learn 安装包一同提供,可以使用 sklearn.datasets.load_* 工具进行加载。

    • 可下载数据 (Downloadable Data):这些大型数据集可供下载,scikit-learn 提供了简化下载过程的工具。这些工具可以在 sklearn.datasets.fetch_* 中找到。

    • 生成数据 (Generated Data):有几种数据集是基于随机种子从模型中生成的。这些可以在 sklearn.datasets.make_* 中获取。

    您可以使用 IPython 的 Tab 补全功能来探索可用的数据集加载器、抓取器和生成器。在从 sklearn 导入 datasets 子模块后,输入:

    datasets.load_<TAB>

    datasets.fetch_<TAB>

    datasets.make_<TAB>

    即可查看可用函数的列表。


    数据和标签的结构

    Scikit-learn 中的数据在大多数情况下都保存为二维的 Numpy 数组,其形状为 (n, m)。许多算法也接受相同形状的 scipy.sparse 矩阵。

    • n (n_samples):样本数量。每个样本都是一个需要处理(例如分类)的项。一个样本可以是一篇文档、一张图片、一段声音、一段视频、一个天文物体、数据库或 CSV 文件中的一行,或者任何您可以用一组固定的定量特征来描述的事物。

    • m (n_features):特征数量,即可以定量描述每个项的独特属性的数量。特征通常是实数值,但在某些情况下也可以是布尔值或离散值。

    Python
    from sklearn import datasets
    

    请注意:这些数据集中的许多都相当大,可能需要很长时间才能下载!


    Scikit-learn makes available a host of
    datasets for testing learning algorithms.
    They come in three flavors:
    •Packaged Data: these small
    datasets are packaged with
    the scikit-learn installation,
    and can be downloaded
    using the tools in
    ••sklearn.datasets.load_*
    Downloadable Data: these larger datasets are available for download, and scikit-learn includes
    tools which streamline this process. These tools can be found in
    sklearn.datasets.fetch_*
    Generated Data: there are several datasets which are generated from models based on a random
    seed. These are available in the sklearn.datasets.make_*
    You can explore the available dataset loaders, fetchers, and generators using IPython's tab-completion
    functionality. After importing the datasets submodule from sklearn , type
    datasets.load_<TAB>
    or
    datasets.fetch_<TAB>
    or
    datasets.make_<TAB>
    to see a list of available functions.
    STRUCTURE OF DATA AND LABELS
    Data in scikit-learn is in most cases saved as two-dimensional Numpy arrays with the shapealgorithms also accept scipy.sparse matrices of the same shape.
    (n, m) . Many
    29
    ••n: (n_samples) The number of samples: each sample is an item to process (e.g. classify). A
    sample can be a document, a picture, a sound, a video, an astronomical object, a row in database
    or CSV file, or whatever you can describe with a fixed set of quantitative traits.
    m: (n_features) The number of features or distinct traits that can be used to describe each item in
    a quantitative manner. Features are generally real-valued, but may be Boolean or discrete-valued
    in some cases.
    from sklearn import datasets
    Be warned: many of these datasets are quite large, and can take a long time to download!