Python编程: 字符串 | The Way To Learn

概览

Python 中的字符串概览：

str1 = "Hello"                # 使用双引号定义一个新字符串
str2 = 'Hello'                # 单引号也可以
str3 = "Hello\tworld\n"       # 包含制表符和换行符的字符串
str4 = str1 + " world"        # 字符串拼接
str5 = str1 + str(4)          # 与数字拼接
str6 = str1[2]                # 第三个字符
str6a = str1[-1]              # 最后一个字符
#str1[0] = "M"                # 不允许修改；字符串是不可变的
for char in str1: print(char) # 遍历字符串中的每个字符
str7 = str1[1:]               # 去掉第一个字符
str8 = str1[:-1]              # 去掉最后一个字符
str9 = str1[1:4]              # 子字符串：从第二个到第四个字符
str10 = str1 * 3              # 重复
str11 = str1.lower()          # 转换为小写
str12 = str1.upper()          # 转换为大写
str13 = str1.rstrip()         # 去除右侧空白字符
str14 = str1.replace('l','h') # 替换
list15 = str1.split('l')      # 分割
if str1 == str2: print("Equ") # 等值测试
if "el" in str1: print("In")  # 子串测试
length = len(str1)            # 字符串长度
pos1 = str1.find('llo')       # 子串的索引，若未找到返回 -1
pos2 = str1.rfind('l')        # 从右侧查找子串的索引
count = str1.count('l')       # 子串出现的次数

print(str1, str2, str3, str4, str5, str6, str7, str8, str9, str10)
print(str11, str12, str13, str14, list15)
print(length, pos1, pos2, count)

有关 Python 字符串的高级模式匹配，请参阅《正则表达式》章节。

字符串操作

相等性

两个字符串相等的条件是它们的内容完全相同，意味着它们长度相同且每个字符的位置一一对应。许多其他语言通过标识来比较字符串；也就是说，只有当两个字符串占据相同的内存空间时，它们才被认为是相等的。Python 使用 is 运算符来测试字符串的标识，以及任何两个对象的标识。

示例：

>>> a = 'hello'; b = 'hello'  # 将 'hello' 赋给 a 和 b
>>> a == b                    # 检查相等性
True
>>> a == 'hello'              # 
True
>>> a == "hello"              # （定界符的选择无关紧要）
True
>>> a == 'hello '             # （额外的空格）
False
>>> a == 'Hello'              # （大小写不匹配）
False

数值操作

可以对字符串执行两种准数值操作——加法和乘法。字符串加法就是拼接字符串，字符串乘法则是重复拼接。所以：

>>> c = 'a'
>>> c + 'b'
'ab'
>>> c * 5
'aaaaa'

包含性

有一个简单的运算符 in，如果第一个操作数包含在第二个操作数中，返回 True。这个操作符也适用于子字符串：

>>> x = 'hello'
>>> y = 'ell'
>>> x in y
False
>>> y in x
True

注意，print(x in y) 也会返回相同的值。

索引与切片

与其他语言中的数组类似，字符串中的每个字符都可以通过表示其在字符串中位置的整数来访问。字符串中的第一个字符是 s[0]，第 n 个字符是 s[n-1]。

>>> s = "Xanadu"
>>> s[1]
'a'

与其他语言中的数组不同，Python 还允许使用负数索引来倒序访问。最后一个字符的索引是 -1，倒数第二个字符的索引是 -2，依此类推。

>>> s[-4]
'n'

我们还可以使用切片来访问字符串的子串。s[a:b] 会返回一个从 s[a] 开始到 s[b-1] 结束的字符串。

>>> s[1:4]
'ana'

这些操作不能进行赋值。

>>> print(s)
>>> s[0] = 'J'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support item assignment
>>> s[1:3] = "up"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object does not support slice assignment
>>> print(s)

假设错误被抑制，输出将是：

Xanadu
Xanadu

切片的另一个特性是，如果省略了开始或结束，默认会使用第一个或最后一个索引，具体取决于上下文：

>>> s[2:]
'nadu'
>>> s[:3]
'Xan'
>>> s[:]
'Xanadu'

你也可以在切片中使用负数索引：

>>> print(s[-2:])
'du'

为了理解切片，最简单的方法是不计数元素本身。它有点像数数时不数手指，而是数它们之间的空间。列表是这样索引的：

元素：   1     2     3     4
索引：   0     1     2     3     4
       -4    -3    -2    -1

因此，当我们请求 [1:3] 切片时，这意味着我们从索引 1 开始，到索引 2 结束，取它们之间的所有元素。如果你习惯于 C 或 Java 中的索引，这可能会让你感到困惑，直到你习惯它。

字符串常量

字符串常量可以在标准字符串模块中找到。一个例子是 string.digits，它等于 '0123456789'。

链接：

Python 文档中的 "string" 模块

字符串方法

Python 提供了许多方法和内置字符串函数：

capitalize
center
count
decode
encode
endswith
expandtabs
find
index
isalnum
isalpha
isdigit
islower
isspace
istitle
isupper
join
ljust
lower
lstrip
replace
rfind
rindex
rjust
rstrip
split
splitlines
startswith
strip
swapcase
title
translate
upper
zfill

只有以下强调的项会被覆盖。

is* 类方法

isalnum()、isalpha()、isdigit()、islower()、isupper()、isspace() 和 istitle() 属于这一类。

被比较的字符串对象的长度必须至少为 1，否则 is* 方法将返回 False。换句话说，长度为 0 的字符串被认为是“空”的，或者说是 False。

isalnum：如果字符串完全由字母和/或数字字符组成（即没有标点符号），则返回 True。
isalpha 和 isdigit：分别针对字母字符或数字字符返回 True。
isspace：如果字符串完全由空白字符组成，则返回 True。
islower、isupper 和 istitle：分别在字符串为小写、大写或标题格式时返回 True。未大写或小写的字符（如数字）是“允许的”，但字符串中必须至少有一个已大写或已小写的字符才会返回 True。标题格式指的是每个单词的第一个字母大写，紧接着的大写字母是小写的。比如 'Y2K'.istitle() 会返回 True，因为大写字母只能跟在没有大写字母的字符后面。类似地，小写字母只能跟在大写字母或其他小写字母后面。提示：空白字符是没有大小写区分的。

示例：

>>> '2YK'.istitle()
False
>>> 'Y2K'.istitle()
True
>>> '2Y K'.istitle()
True

标题、转大写、转小写、交换大小写、首字母大写

这些方法分别返回转换为标题格式、大写、小写、大小写反转和首字母大写的字符串。

title 方法将字符串中每个单词的首字母转换为大写，其余字母转换为小写。单词被视为由字母字符组成的子字符串，这些子字符串由非字母字符（如数字或空白）分隔开。这样可能会导致一些意外的行为。例如，字符串 "x1x" 会被转换成 "X1X"，而不是 "X1x"。
swapcase 方法将所有大写字母转换为小写字母，反之亦然。
capitalize 方法类似于 title，但它将整个字符串视为一个单词（即它将第一个字符大写，剩下的字符小写）。

示例：

s = 'Hello, wOrLD'
print(s)             # 'Hello, wOrLD'
print(s.title())     # 'Hello, World'
print(s.swapcase())  # 'hELLO, WoRld'
print(s.upper())     # 'HELLO, WORLD'
print(s.lower())     # 'hello, world'
print(s.capitalize())# 'Hello, world'

count

返回指定子字符串在字符串中的出现次数。例如：

>>> s = 'Hello, world'
>>> s.count('o')  # 输出 'Hello, World' 中 'o' 的数量（2）
2

提示：.count() 是区分大小写的，所以这个示例只会统计小写字母 'o' 的数量。例如，如果你运行：

>>> s = 'HELLO, WORLD'
>>> s.count('o')  # 输出 'HELLO, WORLD' 中小写 'o' 的数量（0）
0

strip, rstrip, lstrip

返回去除字符串前导（lstrip）和尾部（rstrip）空白字符后的副本。strip 会去掉两端的空白字符。

>>> s = '\t Hello, world\n\t '
>>> print(s)
         Hello, world

>>> print(s.strip())
Hello, world
>>> print(s.lstrip())
Hello, world
        # 结束处
>>> print(s.rstrip())
         Hello, world

注意前导和尾部的制表符和换行符。

strip 方法也可以用来去除其他类型的字符。

import string
s = 'www.wikibooks.org'
print(s)
print(s.strip('w'))                # 从两端去除所有 'w'
print(s.strip(string.ascii_lowercase))   # 从两端去除所有小写字母
print(s.strip(string.printable))   # 去除所有可打印字符

输出：

www.wikibooks.org
.wikibooks.org
.wikibooks.

注意 string.ascii_lowercase 和 string.printable 需要导入 string 模块。

ljust, rjust, center

将字符串向左、向右或居中对齐到给定的字段大小（其余部分填充空格）。

>>> s = 'foo'
>>> s
'foo'
>>> s.ljust(7)
'foo    '
>>> s.rjust(7)
'    foo'
>>> s.center(7)
'  foo  '

join

用指定的字符串作为分隔符，将给定的序列连接起来：

>>> seq = ['1', '2', '3', '4', '5']
>>> ' '.join(seq)
'1 2 3 4 5'
>>> '+'.join(seq)
'1+2+3+4+5'

map 可以在这里发挥作用：（它将 seq 中的数字转换为字符串）

>>> seq = [1, 2, 3, 4, 5]
>>> ' '.join(map(str, seq))
'1 2 3 4 5'

现在，seq 中可以是任意对象，而不仅仅是字符串。

find, index, rfind, rindex

find 和 index 方法返回给定子序列的首次出现的索引。如果没有找到，find 返回 -1，而 index 会引发 ValueError。rfind 和 rindex 与 find 和 index 类似，只是它们从右向左搜索字符串（即找到最后一次出现的索引）。

>>> s = 'Hello, world'
>>> s.find('l')
2
>>> s[s.index('l'):]
'llo, world'
>>> s.rfind('l')
10
>>> s[:s.rindex('l')]
'Hello, wor'
>>> s[s.index('l'):s.rindex('l')]
'llo, wor'

由于 Python 字符串接受负数下标，因此在这种情况下使用 index 更好，因为使用 find 会得到一个意外的值。

replace

replace 方法正如它的名字一样工作。它返回字符串的副本，将第一个参数的所有出现都替换为第二个参数。

>>> 'Hello, world'.replace('o', 'X')
'HellX, wXrld'

或者，使用变量赋值：

string = 'Hello, world'
newString = string.replace('o', 'X')
print(string)
print(newString)

输出：

Hello, world
HellX, wXrld

注意，原始变量（string）在调用 replace 后保持不变。

expandtabs

将制表符替换为适当数量的空格（默认每个制表符为 8 个空格；通过传递制表符大小作为参数可以更改）。

s = 'abcdefg\tabc\ta'
print(s)
print(len(s))
t = s.expandtabs()
print(t)
print(len(t))

输出：

abcdefg abc     a
13
abcdefg abc     a
17

注意，虽然这两个字符串看起来一样，但第二个字符串（t）的长度不同，因为每个制表符都用空格表示，而不是制表符字符。

要使用 4 个空格代替 8 个空格：

v = s.expandtabs(4)
print(v)
print(len(v))

输出：

abcdefg abc a
13

请注意，每个制表符不一定被算作 8 个空格。制表符会将计数“推进”到下一个 8 的倍数。例如：

s = '\t\t'
print(s.expandtabs().replace(' ', '*'))
print(len(s.expandtabs()))

输出：

****************
16

s = 'abc\tabc\tabc'
print(s.expandtabs().replace(' ', '*'))
print(len(s.expandtabs()))

输出：

abc*****abc*****abc
19

split, splitlines

split 方法返回字符串中的单词列表。它可以接受一个分隔符参数，代替默认的空白字符。

>>> s = 'Hello, world'
>>> s.split()
['Hello,', 'world']
>>> s.split('l')
['He', '', 'o, wor', 'd']

注意，在这两种情况下，分隔符不会包含在分割后的字符串中，但空字符串是允许的。

splitlines 方法将多行字符串拆分为多个单行字符串。它类似于 split('\n')（但也接受 '\r' 和 '\r\n' 作为分隔符），只是在字符串以换行符结尾时，splitlines 会忽略最后的换行符（见示例）。

>>> s = """
... One line
... Two lines
... Red lines
... Blue lines
... Green lines
... """
>>> s.split('\n')
['', 'One line', 'Two lines', 'Red lines', 'Blue lines', 'Green lines', '']
>>> s.splitlines()
['', 'One line', 'Two lines', 'Red lines', 'Blue lines', 'Green lines']

split 方法也可以接受多字符

字符串字面值：

txt = 'May the force be with you'
spl = txt.split('the')
print(spl)
# ['May ', ' force be with you']

Unicode

在 Python 3.x 中，所有字符串（str 类型）默认都包含 Unicode 编码。

在 Python 2.x 中，除了 str 类型外，还有专门的 unicode 类型：例如 u = u"Hello"，type(u) 返回 unicode 类型。

在内部帮助中，相关主题名称为 UNICODE。

Python 3.x 示例：

v = "Hello Günther"
# 直接在源代码中使用 Unicode 码点；必须使用 UTF-8 编码
v = "Hello G\xfcnther"
# 使用 \xfc 指定 8 位 Unicode 码点
v = "Hello G\u00fcnther"
# 使用 \u00fc 指定 16 位 Unicode 码点
v = "Hello G\U000000fcnther"
# 使用 \U000000fc 指定 32 位 Unicode 码点，注意大写字母 U
v = "Hello G\N{LATIN SMALL LETTER U WITH DIAERESIS}nther"
# 使用 \N 后跟 Unicode 码点的名称
v = "Hello G\N{latin small letter u with diaeresis}nther"
# 码点名称可以使用小写字母
n = unicodedata.name(chr(252))
# 获取给定 Unicode 字符（这里是 ü）的 Unicode 码点名称
v = "Hello G" + chr(252) + "nther"
# chr() 接受 Unicode 码点并返回一个包含单个 Unicode 字符的字符串
c = ord("ü")
# 获取字符的 Unicode 码点值
b = "Hello Günther".encode("UTF-8")
# 将 Unicode 字符串转换为字节序列（bytes）
b = "Hello Günther".encode("UTF-8"); u = b.decode("UTF-8")
# 使用 decode() 方法将字节解码为 Unicode 字符串
v = b"Hello " + "G\u00fcnther"
# 会抛出 TypeError: can't concat bytes to str
v = b"Hello".decode("ASCII") + "G\u00fcnther"
# 现在可以正常工作
f = open("File.txt", encoding="UTF-8"); lines = f.readlines(); f.close()
# 使用指定编码打开文件并读取内容。如果未指定编码，使用 `locale.getpreferredencoding()`。
f = open("File.txt", "w", encoding="UTF-8"); f.write("Hello G\u00fcnther"); f.close()
# 使用指定编码将内容写入文件
f = open("File.txt", encoding="UTF-8-sig"); lines = f.readlines(); f.close()
# 使用 -sig 编码意味着自动去除字节顺序标记（BOM）
f = tokenize.open("File.txt"); lines = f.readlines(); f.close()
# 根据文件中的编码标记（如 BOM）自动检测编码并去除标记
f = open("File.txt", "w", encoding="UTF-8-sig"); f.write("Hello G\u00fcnther"); f.close()
# 使用 UTF-8 编码写入文件，并在开头写入 BOM

Python 2.x 示例：

v = u"Hello G\u00fcnther"
# 使用 \u00fc 指定 16 位 Unicode 码点
v = u"Hello G\U000000fcnther"
# 使用 \U000000fc 指定 32 位 Unicode 码点，注意大写字母 U
v = u"Hello G\N{LATIN SMALL LETTER U WITH DIAERESIS}nther"
# 使用 \N 后跟 Unicode 码点的名称
v = u"Hello G\N{latin small letter u with diaeresis}nther"
# 码点名称可以使用小写字母
unicodedata.name(unichr(252))
# 获取给定 Unicode 字符（这里是 ü）的 Unicode 码点名称
v = "Hello G" + unichr(252) + "nther"
# `chr()` 接受 Unicode 码点并返回一个包含单个 Unicode 字符的字符串
c = ord(u"ü")
# 获取字符的 Unicode 码点值
b = u"Hello Günther".encode("UTF-8")
# 将 Unicode 字符串转换为字节序列（str 类型），`type(b)` 为 `str`
b = u"Hello Günther".encode("UTF-8"); u = b.decode("UTF-8")
# 使用 `decode()` 方法将字节（`str` 类型）解码为 Unicode 字符串
v = "Hello" + u"Hello G\u00fcnther"
# 可以将 `str`（字节）与 Unicode 字符串拼接而不报错
f = codecs.open("File.txt", encoding="UTF-8"); lines = f.readlines(); f.close()
# 使用指定编码打开文件并读取内容。如果未指定编码，使用 `locale.getpreferredencoding()`。
f = codecs.open("File.txt", "w", encoding="UTF-8"); f.write(u"Hello G\u00fcnther"); f.close()
# 使用指定编码将内容写入文件
# 与 Python 3 不同，Python 2 在写入换行符时会写操作系统特定的换行符，但不是 \n（这在 Windows 上有所不同）。
# 如果要确保文本模式操作，可以使用 `os.linesep`。
f = codecs.open("File.txt", encoding="UTF-8-sig"); lines = f.readlines(); f.close()
# 使用 -sig 编码意味着自动去除字节顺序标记（BOM）

链接：

最后修改: 2025年01月30日星期四 23:24