章节大纲

  • 机器学习简介:数据、经验与评估

    机器学习的核心在于让模型适应数据。因此,首先我们需要了解数据如何被表示,以便计算机能够理解。

    在本章开头,我们引用了汤姆·米切尔 (Tom Mitchell) 对机器学习的定义:“一个设计良好的学习问题:如果一个计算机程序在任务 T 上的表现,由性能度量 P 来衡量,通过经验 E 得到提升,那么就称该程序从经验 E 中学习。”数据是机器学习的“原材料”,机器学习正是从数据中学习。在米切尔的定义中,“数据”隐藏在“经验 E”和“性能度量 P”这两个术语背后。如前所述,我们需要带标签的数据来训练和测试我们的算法。

    然而,在开始训练分类器之前,我们强烈建议您熟悉您的数据Numpy 提供了理想的数据结构来表示您的数据,而 Matplotlib 则为数据可视化提供了强大的功能。

    接下来,我们将使用 sklearn 模块中的数据来演示如何完成这些操作。


    Iris 数据集:机器学习界的“Hello World”

    您看过的第一个程序是什么?我敢打赌,很可能是一个用某种编程语言输出“Hello World”的程序。我大概率是对的。几乎所有编程入门书籍或教程都以这样的程序开始。这个传统可以追溯到 1968 年布莱恩·柯尼汉 (Brian Kernighan) 和丹尼斯·里奇 (Dennis Ritchie) 合著的《C 语言程序设计》一书!

    同样,您在机器学习入门教程中看到的第一个数据集极有可能是“Iris 数据集”。Iris 数据集包含了来自 3 种不同鸢尾花(Iris)的 150 个样本的测量数据:

    • Setosa(山鸢尾)

    • Versicolor(变色鸢尾)

    • Virginica(维吉尼亚鸢尾)

    Iris 数据集因其简单性而经常被使用。这个数据集包含在 scikit-learn 中,但在深入研究 Iris 数据集之前,我们先来看看 scikit-learn 中可用的其他数据集。


    Machine learning is about adapting
    models to data. For this reason we begin
    by showing how data can be represented
    in order to be understood by the computer.
    At the beginning of this chapter we quoted
    Tom Mitchell's definition of machine
    learning: "Well posed Learning Problem:
    A computer program is said to learn from
    experience E with respect to some task T
    and some performance measure P, if its
    performance on T, as measured by P,
    improves with experience E." Data is the
    "raw material" for machine learning. It
    learns from data. In Mitchell's definition,
    "data" is hidden behind the terms
    "experience E" and "performance measure
    P". As mentioned earlier, we need labeled
    data to learn and test our algorithm.
    However, it is recommended that you
    familiarize yourself with your data before
    you begin training your classifier.
    Numpy offers ideal data structures to
    represent your data and Matplotlib offers great possibilities for visualizing your data.
    In the following, we want to show how to do this using the data in the sklearn module.
    IRIS DATASET, "HELLO WORLD" OF MACHINE LEARNING
    What was the first program you saw? I bet it might have been a program giving out "Hello World" in some
    programming language. Most likely I'm right. Almost every introductory book or tutorial on programming
    starts with such a program. It's a tradition that goes back to the 1968 book "The C Programming Language" by
    Brian Kernighan and Dennis Ritchie!
    The likelihood that the first dataset you will see in an introductory tutorial on machine learning will be the
    "Iris dataset" is similarly high. The Iris dataset contains the measurements of 150 iris flowers from 3 different
    species:
    ••Iris-Setosa,
    Iris-Versicolor,and
    15
    IrisIrisIris• Iris-Virginica.
    Setosa
    Versicolor
    Virginica
    16
    The iris dataset is often used for its simplicity. This dataset is contained in scikit-learn, but before we have a
    deeper look into the Iris dataset we will look at the other datasets available in scikit-learn.