Python 机器学习: 2. Python 机器学习

- Python 学习生态系统

Python 简介

Python 是一种流行的面向对象编程语言，具有高级编程语言的能力。其易于学习的语法和可移植性使其在当今非常流行。以下事实为我们介绍了 Python：

Python 由 Guido van Rossum 在荷兰的 Stichting Mathematisch Centrum 开发。
它被写成名为“ABC”的编程语言的继任者。
它的第一个版本于 1991 年发布。
Python 这个名字是 Guido van Rossum 从一个名为 Monty Python's Flying Circus 的电视节目中选取的。
它是一种开源编程语言，这意味着我们可以免费下载它并用它来开发程序。它可以从 www.python.org 下载。
Python 编程语言兼具 Java 和 C 的特性。它拥有优雅的“C”代码，另一方面，它像 Java 一样拥有用于面向对象编程的类和对象。
它是一种解释型语言，这意味着 Python 程序的源代码将首先转换为字节码，然后由 Python 虚拟机执行。

Python 的优点和缺点

每种编程语言都有其优点和缺点，Python 也不例外。

优点

根据研究和调查，Python 是第五重要的语言，也是机器学习和数据科学中最流行的语言。这是因为 Python 具有以下优点：

易于学习和理解： Python 的语法更简单；因此，即使对于初学者来说，学习和理解该语言也相对容易。
多用途语言： Python 是一种多用途编程语言，因为它支持结构化编程、面向对象编程以及函数式编程。
大量模块： Python 拥有大量涵盖编程各个方面的模块。这些模块易于使用，从而使 Python 成为一种可扩展的语言。
开源社区支持： 作为开源编程语言，Python 得到了庞大开发社区的支持。因此，Python 社区可以轻松修复错误。这一特性使 Python 非常健壮和适应性强。
可扩展性： Python 是一种可扩展的编程语言，因为它提供了比 shell 脚本更好的结构来支持大型程序。

缺点

尽管 Python 是一种流行且功能强大的编程语言，但它也有执行速度慢的缺点。

与编译型语言相比，Python 的执行速度较慢，因为 Python 是一种解释型语言。这可能是 Python 社区需要改进的主要领域。

安装 Python

要在 Python 中工作，我们必须首先安装它。您可以通过以下两种方式之一安装 Python：

单独安装 Python
使用预打包的 Python 发行版：Anaconda

让我们详细讨论这些。

单独安装 Python

如果要在计算机上安装 Python，则只需下载适用于您平台的二进制代码。Python 发行版适用于 Windows、Linux 和 Mac 平台。

以下是上述平台上安装 Python 的快速概述：

在 Unix 和 Linux 平台上

通过以下步骤，我们可以在 Unix 和 Linux 平台上安装 Python：

首先，访问 https://www.python.org/downloads/。
接下来，单击链接下载适用于 Unix/Linux 的压缩源代码。
现在，下载并解压文件。
接下来，如果我们要自定义一些选项，可以编辑 Modules/Setup 文件。
1. 接下来，运行命令 ./configure script
2. make
3. make install

在 Windows 平台上

通过以下步骤，我们可以在 Windows 平台上安装 Python：

首先，访问 https://www.python.org/downloads/。
接下来，单击 Windows 安装程序 python-XYZ.msi 文件的链接。这里的 XYZ 是我们要安装的版本。
现在，我们必须运行下载的文件。它将带我们进入 Python 安装向导，该向导易于使用。现在，接受默认设置并等待安装完成。

在 Macintosh 平台上

对于 Mac OS X，建议使用 Homebrew（一个出色且易于使用的包安装程序）来安装 Python 3。如果您没有 Homebrew，可以使用以下命令安装它：

Bash

$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

可以使用以下命令进行更新：

Bash

$ brew update

现在，要在您的系统上安装 Python3，我们需要运行以下命令：

Bash

$ brew install python3

使用预打包的 Python 发行版：Anaconda

Anaconda 是 Python 的一个打包编译版本，其中包含数据科学中广泛使用的所有库。我们可以按照以下步骤使用 Anaconda 设置 Python 环境：

步骤 1：首先，我们需要从 Anaconda 发行版下载所需的安装包。链接是 https://www.anaconda.com/distribution/。您可以根据您的要求选择 Windows、Mac 和 Linux 操作系统。

步骤 2：接下来，选择要安装在机器上的 Python 版本。最新的 Python 版本是 3.7。您将获得 64 位和 32 位图形安装程序的选项。

步骤 3：选择操作系统和 Python 版本后，它将在您的计算机上下载 Anaconda 安装程序。现在，双击文件，安装程序将安装 Anaconda 包。

步骤 4：要检查它是否已安装，打开命令提示符并键入 Python，如下所示：

为什么选择 Python 进行数据科学？

Python 是第五重要的语言，也是机器学习和数据科学中最流行的语言。以下是 Python 的特性，使其成为数据科学的首选语言：

丰富的包集： Python 拥有一套广泛而强大的包，可用于各种领域。它还拥有像 NumPy、SciPy、Pandas、Scikit-learn 等机器学习和数据科学所需的包。
易于原型制作： Python 的另一个重要特性是其轻松快速的原型制作，使其成为数据科学的首选语言。此功能对于开发新算法很有用。
协作功能： 数据科学领域基本上需要良好的协作，而 Python 提供了许多有用的工具，使这一点变得非常容易。
一门语言适用于多个领域： 一个典型的数据科学项目包括数据提取、数据操作、数据分析、特征提取、建模、评估、部署和解决方案更新等各个领域。由于 Python 是一种多用途语言，它允许数据科学家从一个通用平台处理所有这些领域。

Python ML 生态系统的组件

在本节中，让我们讨论一些构成 Python 机器学习生态系统核心组件的数据科学库。这些有用的组件使 Python 成为数据科学的重要语言。尽管有许多这样的组件，但让我们在这里讨论一些 Python 生态系统中的重要组件：

Jupyter Notebook

Jupyter Notebooks 基本上提供了一个交互式计算环境，用于开发基于 Python 的数据科学应用程序。它们以前被称为 ipython notebooks。以下是 Jupyter Notebooks 的一些特性，使其成为 Python ML 生态系统最好的组件之一：

Jupyter Notebooks 可以通过逐步排列代码、图像、文本、输出等内容来逐步说明分析过程。
它帮助数据科学家在开发分析过程时记录思维过程。
还可以将结果捕获为笔记本的一部分。
借助 Jupyter Notebooks，我们还可以与同行共享我们的工作。

安装与执行

如果您使用的是 Anaconda 发行版，则无需单独安装 Jupyter Notebook，因为它已随附安装。您只需转到 Anaconda Prompt 并键入以下命令：

Bash

C:\>jupyter notebook

按下回车键后，它将在您计算机的 localhost:8888 启动一个笔记本服务器。

现在，在单击New选项卡之后，您将获得一个选项列表。选择Python 3，它将带您到新笔记本开始在其中工作。您将在以下截图中瞥见它：

另一方面，如果您使用的是标准 Python 发行版，则可以使用流行的 Python 包安装程序 pip 安装 Jupyter Notebook。

Bash

pip install jupyter

Jupyter Notebook 中的单元格类型

Jupyter Notebook 中有三种类型的单元格：

代码单元格： 顾名思义，我们可以使用这些单元格编写代码。编写代码/内容后，它会将其发送到与笔记本关联的内核。
Markdown 单元格： 我们可以使用这些单元格来注释计算过程。它们可以包含文本、图像、LaTeX 公式、HTML 标签等内容。
原始单元格： 其中写入的文本会原样显示。这些单元格主要用于添加我们不希望被 Jupyter Notebook 自动转换机制转换的文本。

有关 Jupyter Notebook 的更详细学习，您可以访问链接 https://www.tutorialspoint.com/jupyter/index.htm。

NumPy

它是另一个有用的组件，使 Python 成为数据科学最受欢迎的语言之一。它基本上代表 Numerical Python，由多维数组对象组成。通过使用 NumPy，我们可以执行以下重要操作：

对数组进行数学和逻辑运算。
傅里叶变换
与线性代数相关的操作。

我们还可以将 NumPy 视为 MatLab 的替代品，因为 NumPy 主要与 SciPy（科学 Python）和 Matplotlib（绘图库）一起使用。

安装与执行

如果您使用的是 Anaconda 发行版，则无需单独安装 NumPy，因为它已随附安装。您只需通过以下方式将包导入到您的 Python 脚本中：

Python

import numpy as np

另一方面，如果您使用的是标准 Python 发行版，则可以使用流行的 Python 包安装程序 pip 安装 NumPy。

Bash

pip install NumPy

安装 NumPy 后，您可以像上面那样将其导入到您的 Python 脚本中。

有关 NumPy 的更详细学习，您可以访问链接 https://www.tutorialspoint.com/numpy/index.htm。

Pandas

它是另一个有用的 Python 库，使 Python 成为数据科学最受欢迎的语言之一。Pandas 主要用于数据操作、整理和分析。它由 Wes McKinney 于 2008 年开发。借助 Pandas，在数据处理中我们可以完成以下五个步骤：

加载
准备
操作
建模
分析

Pandas 中的数据表示

Pandas 中数据的整个表示是通过以下三种数据结构完成的：

Series： 它基本上是一个带轴标签的一维 ndarray，这意味着它像一个简单的具有同构数据的一维数组。例如，以下 Series 是整数 1,5,10,15,24,25... 的集合。
```
1  5  10  15  24  25  28  36  40  89
```
DataFrame： 它是最有用的数据结构，用于 Pandas 中几乎所有类型的数据表示和操作。它基本上是一个二维数据结构，可以包含异构数据。通常，表格数据通过使用 DataFrames 表示。例如，下表显示了学生的姓名、学号、年龄和性别数据：

| Name | Roll number | Age | Gender |

| :------ | :---------- | :-- | :----- |

| Aarav | 1 | 15 | Male |

| Harshit | 2 | 14 | Male |

| Kanika | 3 | 16 | Female |

| Mayank | 4 | 15 | Male |
Panel： 它是一个三维数据结构，包含异构数据。很难在图形表示中表示 Panel，但可以将其说明为 DataFrame 的容器。

下表向我们展示了 Pandas 中使用的上述数据结构的维度和描述：

Data Structure	Dimension	Description
Series	1-D	大小不可变，一维同构数据
DataFrames	2-D	大小可变，表格形式的异构数据
Panel	3-D	大小可变数组，DataFrame 的容器

我们可以理解这些数据结构，即高维数据结构是低维数据结构的容器。

安装与执行

如果您使用的是 Anaconda 发行版，则无需单独安装 Pandas，因为它已随附安装。您只需通过以下方式将包导入到您的 Python 脚本中：

Python

import pandas as pd

另一方面，如果您使用的是标准 Python 发行版，则可以使用流行的 Python 包安装程序 pip 安装 Pandas。

Bash

pip install Pandas

安装 Pandas 后，您可以像上面那样将其导入到您的 Python 脚本中。

示例

以下是使用 Pandas 从 ndarray 创建 Series 的示例：

Python

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = np.array(['g','a','u','r','a','v'])
In [4]: s = pd.Series(data)
In [5]: print (s)
0    g
1    a
2    u
3    r
4    a
5    v
dtype: object

有关 Pandas 的更详细学习，您可以访问链接 https://www.tutorialspoint.com/python_pandas/index.htm。

Scikit-learn

Scikit-learn 是另一个用于数据科学和 Python 机器学习的有用且最重要的 Python 库。以下是 Scikit-learn 的一些特性，使其如此有用：

它建立在 NumPy、SciPy 和 Matplotlib 之上。
它是开源的，可以在 BSD 许可证下重复使用。
它对所有人开放，可以在各种上下文中重复使用。
借助它，可以实现广泛的机器学习算法，涵盖机器学习的主要领域，如分类、聚类、回归、降维、模型选择等。

安装与执行

如果您使用的是 Anaconda 发行版，则无需单独安装 Scikit-learn，因为它已随附安装。您只需在 Python 脚本中使用该包。例如，通过以下脚本行，我们正在从 Scikit-learn 导入乳腺癌患者数据集：

Python

from sklearn.datasets import load_breast_cancer

另一方面，如果您使用的是标准 Python 发行版并安装了 NumPy 和 SciPy，则可以使用流行的 Python 包安装程序 pip 安装 Scikit-learn。

Bash

pip install -U scikit-learn

安装 Scikit-learn 后，您可以像上面那样在 Python 脚本中使用它。

2. Machine Learning with Python Machine
– Python
Learning Ecosystem
with Python
An Introduction to Python
Python is a popular object-oriented programing language having the capabilities of high-
level programming language. Its easy to learn syntax and portability capability makes it
popular these days. The followings facts gives us the introduction to Python:

Python was developed by Guido van Rossum at Stichting Mathematisch Centrum in
the Netherlands.


It was written as the successor of programming language named ‘ABC’.
It’s first version was released in 1991.

The name Python was picked by Guido van Rossum from a TV show named Monty
Python’s Flying Circus.

It is an open source programming language which means that we can freely
download it and use it to develop programs. It can be downloaded from
www.python.org.

Python programming language is having the features of Java and C both. It is
having the elegant ‘C’ code and on the other hand, it is having classes and objects
like Java for object-oriented programming.

It is an interpreted language, which means the source code of Python program
would be first converted into bytecode and then executed by Python virtual
machine.
Strengths and Weaknesses of Python
Every programming language has some strengths as well as weaknesses, so does Python
too.
Strengths
According to studies and surveys, Python is the fifth most important language as well as
the most popular language for machine learning and data science. It is because of the
following strengths that Python has:
Easy to learn and understand: The syntax of Python is simpler; hence it is relatively
easy, even for beginners also, to learn and understand the language.
Multi-purpose language: Python is a multi-purpose programming language because it
supports structured programming, object-oriented programming as well as functional
programming.
6
Machine Learning with Python
Huge number of modules: Python has huge number of modules for covering every
aspect of programming. These modules are easily available for use hence making Python
an extensible language.
Support of open source community: As being open source programming language,
Python is supported by a very large developer community. Due to this, the bugs are easily
fixed by the Python community. This characteristic makes Python very robust and
adaptive.
Scalability: Python is a scalable programming language because it provides an improved
structure for supporting large programs than shell-scripts.
Weakness
Although Python is a popular and powerful programming language, it has its own weakness
of slow execution speed.
The execution speed of Python is slow as compared to compiled languages because Python
is an interpreted language. This can be the major area of improvement for Python
community.
Installing Python
For working in Python, we must first have to install it. You can perform the installation of
Python in any of the following two ways:

Installing Python individually

Using Pre-packaged Python distribution: Anaconda
Let us discuss these each in detail.
Installing Python Individually
If you want to install Python on your computer, then then you need to download only the
binary code applicable for your platform. Python distribution is available for Windows,
Linux and Mac platforms.
The following is a quick overview of installing Python on the above-mentioned platforms:
On Unix and Linux platform
With the help of following steps, we can install Python on Unix and Linux platform:

First, go to https://www.python.org/downloads/.

Next, click on the link to download zipped source code available for Unix/Linux.

Now, Download and extract files.

Next, we can edit the Modules/Setup file if we want to customize some options.
1. Next, write the command run ./configure script
2. make
3. make install
7
Machine Learning with Python
On Windows platform
With the help of following steps, we can install Python on Windows platform:

First, go to https://www.python.org/downloads/.

Next, click on the link for Windows installer python-XYZ.msi file. Here XYZ is the
version we wish to install.

Now, we must run the file that is downloaded. It will take us to the Python install
wizard, which is easy to use. Now, accept the default settings and wait until the
install is finished.
On Macintosh platform
For Mac OS X, Homebrew, a great and easy to use package installer is recommended to
install Python 3. In case if you don't have Homebrew, you can install it with the help of
following command:
$ ruby -e "$(curl -fsSL
https://raw.githubusercontent.com/Homebrew/install/master/install)"
It can be updated with the command below:
$ brew update
Now, to install Python3 on your system, we need to run the following command:
$ brew install python3
Using Pre-packaged Python Distribution: Anaconda
Anaconda is a packaged compilation of Python which have all the libraries widely used in
Data science. We can follow the following steps to setup Python environment using
Anaconda:
Step1: First, we need to download the required installation package from Anaconda
distribution. The link for the same is https://www.anaconda.com/distribution/. You can
choose from Windows, Mac and Linux OS as per your requirement.
Step2: Next, select the Python version you want to install on your machine. The latest
Python version is 3.7. There you will get the options for 64-bit and 32-bit Graphical installer
both.
Step3: After selecting the OS and Python version, it will download the Anaconda installer
on your computer. Now, double click the file and the installer will install Anaconda package.
Step4: For checking whether it is installed or not, open a command prompt and type
Python as follows:
8
Machine Learning with Python
You
can
also
check
this
in
detailed
video
lecture
at
https://www.tutorialspoint.com/python_essentials_online_training/getting_started_with_
anaconda.asp.
Why Python for Data Science?
Python is the fifth most important language as well as most popular language for Machine
learning and data science. The following are the features of Python that makes it the
preferred choice of language for data science:
Extensive set of packages
Python has an extensive and powerful set of packages which are ready to be used in
various domains. It also has packages like numpy, scipy, pandas, scikit-learn etc.
which are required for machine learning and data science.
Easy prototyping
Another important feature of Python that makes it the choice of language for data science
is the easy and fast prototyping. This feature is useful for developing new algorithm.
Collaboration feature
The field of data science basically needs good collaboration and Python provides many
useful tools that make this extremely.
One language for many domains
A typical data science project includes various domains like data extraction, data
manipulation, data analysis, feature extraction, modelling, evaluation, deployment and
updating the solution. As Python is a multi-purpose language, it allows the data scientist
to address all these domains from a common platform.
9
Machine Learning with Python
Components of Python ML Ecosystem
In this section, let us discuss some core Data Science libraries that form the components
of Python Machine learning ecosystem. These useful components make Python an
important language for Data Science. Though there are many such components, let us
discuss some of the importance components of Python ecosystem here:
Jupyter Notebook
Jupyter notebooks basically provides an interactive computational environment for
developing Python based Data Science applications. They are formerly known as ipython
notebooks. The following are some of the features of Jupyter notebooks that makes it one
of the best components of Python ML ecosystem:

Jupyter notebooks can illustrate the analysis process step by step by arranging the
stuff like code, images, text, output etc. in a step by step manner.



It helps a data scientist to document the thought process while developing the
analysis process.
One can also capture the result as the part of the notebook.
With the help of jupyter notebooks, we can share our work with a peer also.
Installation and Execution
If you are using Anaconda distribution, then you need not install jupyter notebook
separately as it is already installed with it. You just need to go to Anaconda Prompt and
type the following command:
C:\>jupyter notebook
10
Machine Learning with Python
After pressing enter, it will start a notebook server at localhost:8888 of your computer. It is
shown in the following screen shot:
Now, after clicking the New tab, you will get a list of options. Select Python 3 and it will
take you to the new notebook for start working in it. You will get a glimpse of it in the
following screenshots:
11
Machine Learning with Python
On the other hand, if you are using standard Python distribution then jupyter notebook
can be installed using popular python package installer, pip.
pip install jupyter
Types of Cells in Jupyter Notebook
The following are the three types of cells in a jupyter notebook:
Code cells: As the name suggests, we can use these cells to write code. After writing the
code/content, it will send it to the kernel that is associated with the notebook.
Markdown cells: We can use these cells for notating the computation process. They can
contain the stuff like text, images, Latex equations, HTML tags etc.
Raw cells: The text written in them is displayed as it is. These cells are basically used to
add the text that we do not wish to be converted by the automatic conversion mechanism
of jupyter notebook.
For
more
detailed
study
of
jupyter
notebook,
you
can
go
to
the
link
https://www.tutorialspoint.com/jupyter/index.htm.
NumPy
It is another useful component that makes Python as one of the favorite languages for
Data Science. It basically stands for Numerical Python and consists of multidimensional
array objects. By using NumPy, we can perform the following important operations:


Mathematical and logical operations on arrays.
Fourier transformation
12
Machine Learning with Python

Operations associated with linear algebra.
We can also see NumPy as the replacement of MatLab because NumPy is mostly used along
with Scipy (Scientific Python) and Mat-plotlib (plotting library).
Installation and Execution
If you are using Anaconda distribution, then no need to install NumPy separately as it is
already installed with it. You just need to import the package into your Python script with
the help of following:
import numpy as np
On the other hand, if you are using standard Python distribution then NumPy can be installed
using popular python package installer, pip.
pip install NumPy
After installing NumPy, you can import it into your Python script as you did above.
For
more
detailed
study
of
NumPy,
you
can
go
to
the
link
https://www.tutorialspoint.com/numpy/index.htm.
Pandas
It is another useful Python library that makes Python one of the favorite languages for
Data Science. Pandas is basically used for data manipulation, wrangling and analysis. It
was developed by Wes McKinney in 2008. With the help of Pandas, in data processing we
can accomplish the following five steps:

Load

Prepare

Manipulate

Model

Analyze
Data representation in Pandas
The entire representation of data in Pandas is done with the help of following three data
structures:
Series: It is basically a one-dimensional ndarray with an axis label which means it is like a
simple array with homogeneous data. For example, the following series is a collection of
integers 1,5,10,15,24,25...
1
5
10
15
24
25
28
36
40
89
Data frame: It is the most useful data structure and used for almost all kind of data
representation and manipulation in pandas. It is basically a two-dimensional data structure
which can contain heterogeneous data. Generally, tabular data is represented by using
13
Machine Learning with Python
data frames. For example, the following table shows the data of students having their
names and roll numbers, age and gender:
Name
Roll number
Age
Gender
Aarav
1
15
Male
Harshit
2
14
Male
Kanika
3
16
Female
Mayank
4
15
Male
Panel: It is a 3-dimensional data structure containing heterogeneous data. It is very
difficult to represent the panel in graphical representation, but it can be illustrated as a
container of DataFrame.
The following table gives us the dimension and description about above mentioned data
structures used in Pandas:
Data Structure
Dimension
Description
Series
1-D
Size
immutable,
1-D
homogeneous data
DataFrames
2-D
Size
Mutable,
Heterogeneous data in
tabular form
Panel
3-D
Size-mutable
array,
container
of
DataFrame.
We can understand these data structures as the higher dimensional data structure is the
container of lower dimensional data structure.
Installation and Execution
If you are using Anaconda distribution, then no need to install Pandas separately as it is
already installed with it. You just need to import the package into your Python script with
the help of following:
import pandas as pd
On the other hand, if you are using standard Python distribution then Pandas can be installed
using popular python package installer, pip.
pip install Pandas
After installing Pandas, you can import it into your Python script as did above.
14
Machine Learning with Python
Example
The following is an example of creating a series from ndarray by using Pandas:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = np.array(['g','a','u','r','a','v'])
In [4]: s = pd.Series(data)
In [5]: print (s)
0
g
1
a
2
u
3
r
4
a
5
v
dtype: object
For
more
detailed
study
of
Pandas
you
can
go
to
the
link
https://www.tutorialspoint.com/python_pandas/index.htm.
Scikit-learn
Another useful and most important python library for Data Science and machine learning
in Python is Scikit-learn. The following are some features of Scikit-learn that makes it so useful:

It is built on NumPy, SciPy, and Matplotlib.

It is an open source and can be reused under BSD license.

It is accessible to everybody and can be reused in various contexts.

Wide range of machine learning algorithms covering major areas of ML like
classification, clustering, regression, dimensionality reduction, model selection etc.
can be implemented with the help of it.
Installation and Execution
If you are using Anaconda distribution, then no need to install Scikit-learn separately as it is
already installed with it. You just need to use the package into your Python script. For
example, with following line of script we are importing dataset of breast cancer patients
from Scikit-learn:
15
Machine Learning with Python
from sklearn.datasets import load_breast_cancer
On the other hand, if you are using standard Python distribution and having NumPy and
SciPy then Scikit-learn can be installed using popular python package installer, pip.
pip install -U scikit-learn
After installing Scikit-learn, you can use it into your Python script as you have done above.

Last modified: Thursday, 26 June 2025, 9:25 AM