为了评估分类器的性能,你应该始终在未见过的数据上测试模型。因此,在构建模型之前,请将数据分为两部分:训练集 (training set)测试集 (test set)

你使用训练集在开发阶段训练和评估模型。然后,你使用训练好的模型对未见过的测试集进行预测。这种方法可以让你了解模型的性能和鲁棒性。

幸运的是,Scikit-learn 有一个名为 train_test_split() 的函数,它可以将你的数据划分为这些集合。导入该函数,然后用它来分割数据:

ML Tutorial

Python
...
from sklearn.model_selection import train_test_split

# Split our data
train, test, train_labels, test_labels = train_test_split(features,
                                                            labels,
                                                            test_size=0.33,
                                                            random_state=42)

该函数使用 test_size 参数随机分割数据。在这个例子中,我们现在有一个测试集 (test),它占原始数据集的 33%。其余数据 (train) 则构成训练数据。我们还有 traintest 变量各自的标签,即 train_labelstest_labels

现在我们可以继续训练我们的第一个模型了。


Step 3 — Organizing Data into Sets
To evaluate how well a classifier is performing, you should always test
the model on unseen data. Therefore, before building a model, split your
data into two parts: a training set and a test set.
You use the training set to train and evaluate the model during the
development stage. You then use the trained model to make predictions
on the unseen test set. This approach gives you a sense of the model’s
performance and robustness.
Fortunately, sklearn has a function called train_test_split(),
which divides your data into these sets. Import the function and then use
it to split the data:
ML Tutorial
...
from sklearn.model_selection import train_test_split
# Split our data
train, test, train_labels, test_labels = train_test_split(features,
labels,
test_size=0.33,
random_state=42)
The function randomly splits the data using the test_size
parameter. In this example, we now have a test set (test) that represents
33% of the original dataset. The remaining data (train) then makes up
the training data. We also have the respective labels for both the
train/test variables, i.e. train_labels and test_labels.
We can now move on to training our first model.

Last modified: Wednesday, 25 June 2025, 11:38 AM