Python字符串从入门到精通的实战指南

2026-06-01 09:28:28 作者：码流怪侠

字符串是 Python 中最常用的数据类型之一,无论是数据处理、Web 开发、日志分析还是音视频编解码,字符串操作都是基本功,本文从创建到正则实战,带你彻底搞懂 Python 字符串,需要的朋友可以参考下

1. 字符串创建与基本操作

1.1 四种创建方式

Python 提供了四种引号方式来创建字符串，它们在语法上等价，但各有适用场景：

# ========== 四种创建方式 ==========
s1 = '单引号字符串'          # 最常用，适合短文本
s2 = "双引号字符串"          # 与单引号等价，适合包含单引号的内容
s3 = '''三重单引号
可以换行
多行文本'''                  # 多行字符串
s4 = """三重双引号
也可以换行
多行文本"""                  # 多行字符串，常作文档字符串(docstring)

print(type(s1))  # <class 'str'>
print(s1)        # 单引号字符串
print(s3)        # 三重单引号\n可以换行\n多行文本（实际输出会换行）

单引号 vs 双引号的选择原则：

# 原则：外层引号与内层引号不同，避免转义
msg1 = "It's a beautiful day"     # ✅ 外双内单，无需转义
msg2 = 'He said "Hello"'          # ✅ 外单内双，无需转义
msg3 = 'It\'s a beautiful day'    # ⚠️ 可以但没必要，转义降低可读性

# 团队规范：选择一种风格保持统一，推荐双引号（PEP 257 docstring 用三双引号）

1.2 原始字符串与转义

# ========== 转义字符速查 ==========
print("换行符: \n")       # 换行
print("制表符: \t")       # 制表
print("反斜杠: \\")       # 反斜杠本身
print("单引号: \'")       # 单引号
print("双引号: \"")       # 双引号
print("回车符: \r")       # 回车（覆盖行首）

# ========== 原始字符串 r"" ==========
# Windows 路径 —— 不用 r 就得双重转义
path1 = "C:\\Users\\yance\\Desktop"    # 普通字符串，每个 \ 都要转义
path2 = r"C:\Users\yance\Desktop"      # 原始字符串，所见即所得 ✅

# 正则表达式 —— 不用 r 就变成"转义地狱"
import re
# 匹配数字 \d，不用 r 的话：
re.compile("\\d+")    # Python 先转义为 \d，re 再解释为数字
# 用 r 的话：
re.compile(r"\d+")    # 直达正则引擎，清晰明了 ✅

# ⚠️ 原始字符串的陷阱：不能以奇数个反斜杠结尾
# r"hello\"   # SyntaxError! 反斜杠转义了结尾引号
r"hello\\"   # ✅ 等价于 "hello\\"

1.3 f-string 格式化（Python 3.6+）

Python 历史上出现了三种字符串格式化方式，f-string 是目前的最优解：

name = "yance"
age = 28
score = 95.678

# ========== 三种格式化方式对比 ==========

# ❶ % 格式化（C 风格，老旧不推荐）
print("Name: %s, Age: %d" % (name, age))

# ❷ str.format()（Python 2.6+，可读性一般）
print("Name: {}, Age: {}".format(name, age))
print("Name: {0}, Age: {1}".format(name, age))

# ❸ f-string（Python 3.6+，推荐 ✅）
print(f"Name: {name}, Age: {age}")

# ========== f-string 高级用法 ==========

# 表达式求值
print(f"Next year: {age + 1}")              # 直接运算
print(f"Name upper: {name.upper()}")         # 调用方法
print(f"Len: {len(name)}")                   # 调用函数

# 格式控制：{value:format_spec}
print(f"Score: {score:.2f}")                 # 保留2位小数 → 95.68
print(f"Score: {score:>10.2f}")              # 右对齐，宽度10 →     95.68
print(f"Score: {score:<10.2f}")              # 左对齐 → 95.68
print(f"Score: {score:^10.2f}")              # 居中 →   95.68
print(f"Score: {score:0>10.2f}")             # 前补零 → 0000095.68

# 数字格式化
num = 1234567
print(f"千分位: {num:,}")                    # 1,234,567
print(f"百分比: {0.856:.1%}")                # 85.6%
print(f"科学计数: {num:.2e}")                # 1.23e+06
print(f"二进制: {42:b}")                     # 101010
print(f"十六进制: {255:x}")                  # ff
print(f"八进制: {255:o}")                    # 377

# Python 3.8+ 调试语法 =（变量名=值）
x = 42
print(f"{x = }")           # x = 42
print(f"{x * 2 = }")       # x * 2 = 84
print(f"{name = !r}")       # name = 'yance'  (!r 显示引号)

格式化方式对比总结：

方式	版本	可读性	性能	推荐度
`%s`	全版本	★★☆	★★☆	⚠️ 仅维护老代码
`.format()`	2.6+	★★★	★★★	🔶 兼容旧版本时用
`f-string`	3.6+	★★★★	★★★★	✅ 首选

1.4 字符串拼接与乘法

# ========== 拼接方式 ==========
# ❶ + 号拼接（少量字符串）
greeting = "Hello" + " " + "World"    # "Hello World"

# ❷ join 拼接（大量字符串，性能最优）✅
words = ["Python", "is", "awesome"]
sentence = " ".join(words)             # "Python is awesome"

# ❸ f-string 拼接（现代写法）✅
lang = "Python"
sentence = f"{lang} is awesome"

# ⚠️ 性能对比：+ 号 vs join
import time

def concat_plus(n=100000):
    s = ""
    for i in range(n):
        s += "a"
    return s

def concat_join(n=100000):
    return "".join("a" for _ in range(n))

# join 比 + 快约 2-3 倍（CPython 对 += 有优化，但 join 仍更优）

# ========== 字符串乘法 ==========
separator = "-" * 40           # "----------------------------------------"
indent = "    " * 3            # 12个空格
pattern = "ab" * 5             # "ababababab"
print(separator)

2. 索引、切片与步进

2.1 索引机制

Python 字符串支持双向索引：正索引从 0 开始，负索引从 -1 开始。

字符串:  P   y   t   h   o   n
正索引:  0   1   2   3   4   5
负索引: -6  -5  -4  -3  -2  -1

s = "Python"

# ========== 正索引 ==========
print(s[0])     # 'P'    首字符
print(s[5])     # 'n'    末字符

# ========== 负索引 ==========
print(s[-1])    # 'n'    末字符（最常用！）
print(s[-6])    # 'P'    首字符
print(s[-2])    # 'o'    倒数第二个

# ========== 索引越界 ==========
# s[6]         # IndexError: string index out of range
# s[-7]        # IndexError: string index out of range

2.2 切片 [start:stop:step]

切片是 Python 最优雅的特性之一，左闭右开，不会越界报错：

s = "ABCDEFGHIJ"
#    0123456789

# ========== 基础切片 ==========
print(s[2:5])     # 'CDE'     索引 2,3,4（不含5）
print(s[:5])      # 'ABCDE'   从头到索引4
print(s[5:])      # 'FGHIJ'   从索引5到末尾
print(s[:])       # 'ABCDEFGHIJ'  完整拷贝

# ========== 负索引切片 ==========
print(s[-3:])     # 'HIJ'     最后3个字符
print(s[:-3])     # 'ABCDEFG'  除最后3个之外
print(s[-5:-2])   # 'FGH'     倒数第5到倒数第3（不含-2）

# ========== 越界安全 ==========
print(s[5:100])   # 'FGHIJ'   不报错，取到末尾
print(s[100:200]) # ''        不报错，返回空串

2.3 步进 step

步进（step）控制切片的方向和跨度：

s = "ABCDEFGHIJ"

# ========== 正向步进 ==========
print(s[::2])     # 'ACEGI'   每隔1个取1个（奇数位）
print(s[1::2])    # 'BDFHJ'   偶数位
print(s[::3])     # 'ADGJ'    每隔2个取1个

# ========== 反向步进（负步进） ==========
print(s[::-1])    # 'JIHGFEDCBA'  反转字符串！最常用技巧 ✅
print(s[::-2])    # 'JHFDB'       反向每隔1个取1个
print(s[8:2:-1])  # 'IHGFED'      从索引8反向切到索引3
print(s[7:2:-2])  # 'HFD'         反向每隔1个

# ========== 步进方向与起止方向必须一致 ==========
print(s[2:8:-1])  # ''  步进是反向，但起止是正向 → 空串
print(s[8:2:1])   # ''  步进是正向，但起止是反向 → 空串

切片速记口诀：

切片左闭右开取，起止省略到两头。
步进为正从左走，步进为负往右搜。
方向冲突空串还，越界安全不报错。

2.4 切片赋值？—— 字符串不可变！

s = "Python"
# s[0] = "J"    # TypeError: 'str' object does not support item assignment

# 修改字符串的唯一方式：创建新字符串
s = "J" + s[1:]   # 'Jython'  拼接新串
s = s.replace("J", "P")  # 'Python' 用方法返回新串

# 不可变的好处：
# 1. 可以安全地作为字典的 key
# 2. 多个变量可以共享同一字符串对象（内存优化）
# 3. 线程安全

3. 常用方法详解

3.1 查找类：find / index / count / rfind

s = "Hello, Python! Python is great."

# ========== find / rfind ==========
# find(sub, start, end) → 返回子串首次出现的索引，找不到返回 -1
print(s.find("Python"))       # 7    首次出现
print(s.find("Python", 10))   # 15   从索引10开始找
print(s.find("Java"))         # -1   找不到
print(s.rfind("Python"))      # 15   最后一次出现的索引

# ========== index / rindex ==========
# index 与 find 功能相同，但找不到时抛出 ValueError
print(s.index("Python"))      # 7
# s.index("Java")            # ValueError: substring not found

# ========== count ==========
print(s.count("Python"))      # 2    出现次数
print(s.count("o"))           # 2
print(s.count("z"))           # 0

# ========== find vs index 选择建议 ==========
# 确定子串存在时用 index（找不到是异常，暴露 bug）
# 不确定子串是否存在时用 find（找不到返回 -1，正常逻辑）

3.2 替换：replace

s = "Hello, World! World is beautiful."

# ========== 基本替换 ==========
print(s.replace("World", "Python"))
# "Hello, Python! Python is beautiful."
# 默认替换所有出现

# ========== 限制替换次数 ==========
print(s.replace("World", "Python", 1))
# "Hello, Python! World is beautiful."  只替换第1个

# ========== 链式替换 ==========
result = s.replace("World", "Python").replace("beautiful", "awesome")
print(result)
# "Hello, Python! Python is awesome."

# ⚠️ replace 返回新字符串，原字符串不变（字符串不可变！）
original = "abc"
new = original.replace("b", "X")   # "aXc"
print(original)                     # "abc"  原串未变

3.3 分割与合并：split / rsplit / splitlines / join

# ========== split / rsplit ==========
csv_data = "name,age,city"
print(csv_data.split(","))         # ['name', 'age', 'city']

# 限制分割次数
text = "a-b-c-d-e"
print(text.split("-", 2))          # ['a', 'b', 'c-d-e']  从左分割2次
print(text.rsplit("-", 2))         # ['a-b-c', 'd', 'e']   从右分割2次

# 默认按空白字符分割（空格、Tab、换行都行）
messy = "  hello   world  \t python  \n  "
print(messy.split())               # ['hello', 'world', 'python']

# ========== splitlines ==========
multiline = "第一行\n第二行\r\n第三行\r第四行"
print(multiline.splitlines())      # ['第一行', '第二行', '第三行', '第四行']
# 自动识别 \n, \r\n, \r 三种换行符

# ========== join（split 的逆操作）==========
words = ["Python", "is", "awesome"]
print(" ".join(words))             # "Python is awesome"
print("-".join(words))             # "Python-is-awesome"
print("".join(words))              # "Pythonisawesome"

# 经典用法：路径拼接
import os
dirs = ["home", "yance", "projects"]
path = "/".join(dirs)              # "home/yance/projects"
# 实际项目推荐 os.path.join() 或 pathlib

# ⚠️ join 只能连接字符串列表，数字需要先转换
nums = [1, 2, 3]
# ",".join(nums)                   # TypeError!
print(",".join(str(n) for n in nums))  # "1,2,3" ✅

3.4 去除空白：strip / lstrip / rstrip

# ========== 基本去空白 ==========
s = "  Hello, World!  "
print(s.strip())      # "Hello, World!"   两端去除
print(s.lstrip())     # "Hello, World!  " 去除左端
print(s.rstrip())     # "  Hello, World!" 去除右端

# ========== 去除指定字符 ==========
# strip(chars) —— 去除两端所有在 chars 中的字符（不是去除子串！）
s2 = "***Hello***"
print(s2.strip("*"))        # "Hello"

s3 = "xxxPythonxxx"
print(s3.strip("x"))        # "Python"

s4 = "ABChelloCBA"
print(s4.strip("ABC"))      # "hello"  A/B/C 都会被去除

# ⚠️ 常见误区
s5 = "  Hello World  "
print(s5.strip())           # "Hello World"  只去两端，中间空格保留！

s6 = "abchelloabc"
print(s6.strip("abc"))      # "hello"  不是去子串"abc"，是去 a/b/c 三个字符

# ========== 实际应用 ==========
# 读取用户输入时几乎总要 strip
user_input = input("请输入: ").strip()

# 清洗 CSV 数据
raw_fields = ["  Alice  ", " 28 ", " Beijing "]
clean = [f.strip() for f in raw_fields]  # ['Alice', '28', 'Beijing']

3.5 编码与解码：encode / decode

# ========== encode: str → bytes ==========
s = "你好，Python"
print(type(s))               # <class 'str'>

b_utf8 = s.encode("utf-8")   # UTF-8 编码（推荐 ✅）
b_gbk = s.encode("gbk")      # GBK 编码（中文 Windows 常见）
print(b_utf8)                # b'\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x8cPython'
print(b_gbk)                 # b'\xc4\xe3\xba\xc3\xa3\xacPython'

# ========== decode: bytes → str ==========
print(b_utf8.decode("utf-8"))   # "你好，Python"
print(b_gbk.decode("gbk"))      # "你好，Python"

# ⚠️ 编解码必须一致，否则乱码或报错
# b_utf8.decode("gbk")          # UnicodeDecodeError 或乱码！

# ========== 错误处理策略 ==========
bad_bytes = b"\xff\xfe invalid"
print(bad_bytes.decode("utf-8", errors="ignore"))    # " invalid"  忽略错误字节
print(bad_bytes.decode("utf-8", errors="replace"))    # "�� invalid" 替换为�
print(bad_bytes.decode("utf-8", errors="backslashreplace"))  # "\\xff\\xfe invalid"

# ========== 常用场景 ==========
# 1. 网络传输
data = "Hello".encode("utf-8")   # 发送前编码
msg = data.decode("utf-8")       # 接收后解码

# 2. 文件读写
with open("test.txt", "w", encoding="utf-8") as f:  # 指定编码
    f.write("你好")

# 3. 判断编码
import chardet
raw = "你好".encode("gbk")
result = chardet.detect(raw)      # {'encoding': 'GB2312', 'confidence': 0.99, ...}

3.6 其他常用方法速查

s = "Hello, Python!"

# ========== 大小写 ==========
print(s.upper())          # "HELLO, PYTHON!"     全大写
print(s.lower())          # "hello, python!"      全小写
print(s.title())          # "Hello, Python!"      每个单词首字母大写
print(s.capitalize())     # "Hello, python!"      仅首字母大写
print(s.swapcase())       # "hELLO, pYTHON!"      大小写互换

# ========== 判断类 ==========
print("123".isdigit())        # True   是否全是数字
print("abc".isalpha())        # True   是否全是字母
print("abc123".isalnum())     # True   是否全是字母或数字
print("   ".isspace())        # True   是否全是空白
print("Hello".isupper())      # False  是否全大写
print("hello".islower())      # True   是否全小写
print("Hello World".istitle()) # True  是否标题格式

# ========== 填充对齐 ==========
print("Python".center(20, "-"))    # "-------Python-------"
print("Python".ljust(20, "-"))     # "Python--------------"
print("Python".rjust(20, "-"))     # "--------------Python"
print("42".zfill(6))               # "000042"  左补零（处理编号常用）

# ========== 前缀后缀 ==========
print("test.py".startswith("test"))   # True
print("test.py".endswith(".py"))      # True
# 支持元组参数
print("test.py".endswith((".py", ".txt")))   # True

# ========== 判断包含 ==========
print("Python" in "I love Python")   # True   推荐方式 ✅
print("Java" not in "I love Python") # True

4. 正则表达式入门：re 模块核心 API

4.1 正则表达式是什么？

正则表达式（Regular Expression，简称 regex）是一种模式匹配语言，用于在文本中搜索、匹配、替换符合特定规则的字符串。

普通字符串查找：  "error" in log        → 只能找固定文本
正则表达式查找：  re.search(r"error\d+", log)  → 能找 error42、error007 等模式

4.2 核心元字符速查表

元字符	含义	示例	匹配
`.`	任意单个字符（除换行）	`a.c`	abc, a1c, a c
`^`	行首	`^Hello`	行首的 Hello
`$`	行尾	`world$`	行尾的 world
`*`	前一项 0 次或多次	`ab*c`	ac, abc, abbc
`+`	前一项 1 次或多次	`ab+c`	abc, abbc
`?`	前一项 0 次或 1 次	`ab?c`	ac, abc
`{n}`	前一项恰好 n 次	`\d{3}`	123
`{n,m}`	前一项 n 到 m 次	`\d{2,4}`	12, 123, 1234
`[]`	字符集	`[aeiou]`	a, e, i, o, u
`[^]`	取反字符集	`[^0-9]`	非数字
`()`	分组（捕获）	`(ab)+`	ab, abab
`\|`	或	`cat\|dog`	cat 或 dog
`\d`	数字 `[0-9]`	`\d+`	42, 007
`\w`	单词字符 `[a-zA-Z0-9_]`	`\w+`	hello_42
`\s`	空白字符	`\s+`	空格/Tab/换行
`\D`	非数字	`\D+`	abc
`\W`	非单词字符	`\W+`	!@#
`\S`	非空白	`\S+`	非空白序列

4.3 re 模块六大核心 API

import re

text = "My phone is 138-1234-5678 and office is 010-8765-4321"

# ========== ❶ re.search() —— 搜索第一个匹配 ==========
# 扫描整个字符串，返回第一个匹配的 Match 对象或 None
match = re.search(r"\d{3}-\d{4}-\d{4}", text)
if match:
    print(f"找到: {match.group()}")     # 138-1234-5678
    print(f"位置: {match.start()}-{match.end()}")  # 位置: 12-27
    print(f"span: {match.span()}")      # (12, 27)

# ========== ❷ re.match() —— 只匹配字符串开头 ==========
# 仅在字符串开头匹配，开头不匹配则返回 None
result1 = re.match(r"My", text)          # ✅ 匹配成功
result2 = re.match(r"phone", text)       # None，不在开头
# 实际开发中更推荐用 search + ^ 代替 match
result3 = re.search(r"^My", text)        # 等价于 match

# ========== ❸ re.findall() —— 找出所有匹配，返回列表 ==========
phones = re.findall(r"\d{3}-\d{4}-\d{4}", text)
print(phones)   # ['138-1234-5678', '010-8765-4321']

# 带分组的 findall —— 返回分组内容的元组列表
pattern = r"(\d{3})-(\d{4})-(\d{4})"
groups = re.findall(pattern, text)
print(groups)   # [('138', '1234', '5678'), ('010', '8765', '4321')]

# ========== ❹ re.finditer() —— 迭代器版 findall ==========
# 返回迭代器，每个元素是 Match 对象（比 findall 更节省内存）
for m in re.finditer(r"\d{3}-\d{4}-\d{4}", text):
    print(f"Phone: {m.group()}, Position: {m.span()}")

# ========== ❺ re.sub() —— 替换 ==========
# re.sub(pattern, replacement, string, count=0)
# 隐藏手机号中间四位
hidden = re.sub(r"(\d{3})-\d{4}-(\d{4})", r"\1-****-\2", text)
print(hidden)   # My phone is 138-****-5678 and office is 010-****-4321

# 使用函数作为 replacement
def mask_phone(match):
    return f"{match.group(1)}-****-{match.group(3)}"

hidden2 = re.sub(r"(\d{3})-(\d{4})-(\d{4})", mask_phone, text)

# ========== ❻ re.split() —— 正则分割 ==========
# 按多种分隔符分割
data = "one,two;three|four"
result = re.split(r"[,;|]", data)
print(result)   # ['one', 'two', 'three', 'four']

# 带捕获组的 split —— 分隔符也会出现在结果中
result2 = re.split(r"([,;|])", data)
print(result2)  # ['one', ',', 'two', ';', 'three', '|', 'four']

4.4 Match 对象常用方法

import re

text = "Order #12345, total: $99.50"
match = re.search(r"Order #(\d+), total: \$(\d+\.\d+)", text)

if match:
    match.group()        # 'Order #12345, total: $99.50'  完整匹配
    match.group(0)       # 同上，0 = 整体匹配
    match.group(1)       # '12345'  第1个分组
    match.group(2)       # '99.50'  第2个分组
    match.groups()       # ('12345', '99.50')  所有分组元组
    match.start()        # 0   匹配起始位置
    match.end()          # 29  匹配结束位置
    match.span()         # (0, 29)  匹配范围

    # 命名分组（更可读）
    match2 = re.search(r"Order #(?P<id>\d+), total: \$(?P<amount>\d+\.\d+)", text)
    match2.group("id")       # '12345'
    match2.group("amount")   # '99.50'
    match2.groupdict()       # {'id': '12345', 'amount': '99.50'}

4.5 编译正则：re.compile

当同一个正则需要反复使用时，预编译可以提升性能：

import re

# ========== 预编译 ==========
phone_pattern = re.compile(r"(\d{3})-(\d{4})-(\d{4})")

# 之后直接用编译好的对象调用方法
text1 = "Call 138-1234-5678"
text2 = "Fax 010-8765-4321"

match1 = phone_pattern.search(text1)
match2 = phone_pattern.search(text2)

# 编译时可以加标志位
case_insensitive = re.compile(r"hello", re.IGNORECASE)  # 忽略大小写
multiline_mode = re.compile(r"^hello", re.MULTILINE)     # ^ 匹配每行行首
dotall_mode = re.compile(r"hello.world", re.DOTALL)      # . 匹配换行符

# 常用标志位
# re.IGNORECASE / re.I   忽略大小写
# re.MULTILINE / re.M    ^ 和 $ 匹配每行
# re.DOTALL / re.S       . 匹配换行符
# re.VERBOSE / re.X      允许写注释（复杂正则推荐）

VERBOSE 模式——让复杂正则可读：

import re

# 普通写法：难以阅读
pattern1 = r"(?P<hour>\d{2}):(?P<min>\d{2}):(?P<sec>\d{2})\.(?P<ms>\d{3})"

# VERBOSE 写法：清晰明了 ✅
pattern2 = re.compile(r"""
    (?P<hour>\d{2})      # 时：2位数字
    :                    # 分隔符
    (?P<min>\d{2})       # 分：2位数字
    :                    # 分隔符
    (?P<sec>\d{2})       # 秒：2位数字
    \.                   # 小数点
    (?P<ms>\d{3})        # 毫秒：3位数字
""", re.VERBOSE)

time_text = "14:30:25.123"
match = pattern2.search(time_text)
print(match.groupdict())  # {'hour': '14', 'min': '30', 'sec': '25', 'ms': '123'}

4.6 贪婪 vs 非贪婪

import re

html = "<div>Hello</div><div>World</div>"

# ========== 贪婪匹配（默认） ==========
# .* 和 .+ 会尽可能多地匹配
greedy = re.findall(r"<div>.*</div>", html)
print(greedy)   # ['<div>Hello</div><div>World</div>']  匹配了整个！⚠️

# ========== 非贪婪匹配（加 ?） ==========
# .*? 和 .+? 会尽可能少地匹配
non_greedy = re.findall(r"<div>.*?</div>", html)
print(non_greedy)  # ['<div>Hello</div>', '<div>World</div>']  ✅ 两个独立匹配

# 规则总结：
# 贪婪：  *   +   ?   {n,m}    → 尽量多匹配
# 非贪婪：*?  +?  ??  {n,m}?   → 尽量少匹配

4.7 常用正则模式速查

import re

# 邮箱
email_re = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

# 手机号（中国大陆）
phone_re = r"1[3-9]\d{9}"

# IP 地址
ip_re = r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}"

# URL
url_re = r"https?://[^\s<>\"']+\.[a-zA-Z]{2,}"

# 日期 YYYY-MM-DD
date_re = r"\d{4}-\d{2}-\d{2}"

# 身份证号（18位）
id_re = r"\d{17}[\dXx]"

# 中文字符
chinese_re = r"[\u4e00-\u9fa5]+"

# 快速验证
test_email = "user@example.com"
print(bool(re.fullmatch(email_re, test_email)))  # True

5. 字节串 bytes vs 字符串 str 详解

5.1 本质区别

┌──────────────────────────────────────────────────┐
│                    str (字符串)                     │
│  - 存储的是 Unicode 码点（抽象字符）                │
│  - 人能看懂的文本                                   │
│  - Python 3 中所有字符串默认是 str                  │
│  - 示例: "你好"                                    │
├──────────────────────────────────────────────────┤
│                   bytes (字节串)                    │
│  - 存储的是原始字节（0~255 的整数序列）             │
│  - 机器能看懂的二进制数据                           │
│  - 网络传输、文件存储、编解码的中间形态              │
│  - 示例: b'\xe4\xbd\xa0\xe5\xa5\xbd'              │
└──────────────────────────────────────────────────┘

转换关系:
  encode()                    decode()
  str ──────────→ bytes ──────────→ str
       编码            解码

5.2 创建方式对比

# ========== str 创建 ==========
s1 = "Hello"              # 普通字符串
s2 = '你好'               # Unicode 字符串
s3 = """多行
字符串"""                  # 多行字符串

# ========== bytes 创建 ==========
b1 = b"Hello"             # 字节串字面量（只支持 ASCII 字符）
b2 = b'\x48\x65\x6c\x6c\x6f'  # 十六进制表示
b3 = bytes([72, 101, 108, 108, 111])  # 从整数列表创建
b4 = "你好".encode("utf-8")    # 从字符串编码

# ⚠️ bytes 字面量只能包含 ASCII 字符
# b"你好"               # SyntaxError!
# 必须通过 encode 创建含中文的 bytes

# ========== bytearray（可变字节串） ==========
ba = bytearray(b"Hello")
ba[0] = 74               # 可修改！bytearray 是可变的
print(ba)                 # bytearray(b'Jello')

5.3 操作对比

s = "Hello"
b = b"Hello"

# ========== 相同的操作 ==========
print(len(s), len(b))          # 5 5
print(s[0], b[0])              # H 72  ⚠️ str 索引返回字符，bytes 索引返回整数！
print(s[1:3], b[1:3])          # el b'el'  切片都返回同类型
print(s + " World", b + b" World")  # 拼接
print(s * 2, b * 2)            # 重复

# ========== 不同的操作 ==========
# 索引类型不同
print(type(s[0]))    # <class 'str'>
print(type(b[0]))    # <class 'int'>   ⚠️ 关键区别！

# 遍历
for ch in "ABC":
    print(ch)         # A  B  C     → 字符

for byte in b"ABC":
    print(byte)       # 65 66 67    → 整数

# 包含判断
print("H" in "Hello")     # True   str 中判断字符
print(72 in b"Hello")     # True   bytes 中判断整数
print(b"H" in b"Hello")   # True   bytes 中也可以判断字节串

# ⚠️ 不能混合操作
# "Hello" + b" World"     # TypeError!
# "Hello" == b"Hello"     # False  类型不同，永不相等

5.4 编码深入：UTF-8 vs GBK vs ASCII

text = "你好A"

# ========== 不同编码对比 ==========
utf8_bytes = text.encode("utf-8")    # b'\xe4\xbd\xa0\xe5\xa5\xbdA'  6字节
gbk_bytes = text.encode("gbk")       # b'\xc4\xe3\xba\xc3A'          5字节

print(f"UTF-8: {utf8_bytes}  长度: {len(utf8_bytes)}")
print(f"GBK:   {gbk_bytes}  长度: {len(gbk_bytes)}")

# 编码规则：
# UTF-8: 中文3字节，ASCII 1字节（变长编码，互联网标准）✅
# GBK:   中文2字节，ASCII 1字节（中文 Windows 传统编码）
# ASCII: 只支持英文，1字节（0-127）

# ========== 查看字符的 Unicode 码点 ==========
print(ord("你"))       # 20320   Unicode 码点
print(ord("A"))        # 65
print(chr(20320))      # "你"    码点转字符
print(chr(65))         # "A"

# ========== 实际场景 ==========
# 场景1：判断文件编码
def detect_file_encoding(filepath):
    """检测文件编码并读取"""
    with open(filepath, "rb") as f:       # 二进制模式读取
        raw = f.read(4096)                 # 读取前4KB
    import chardet
    result = chardet.detect(raw)
    encoding = result["encoding"]
    with open(filepath, "r", encoding=encoding) as f:
        return f.read()

# 场景2：网络数据
import json
data = {"name": "yance", "msg": "你好"}
json_bytes = json.dumps(data, ensure_ascii=False).encode("utf-8")
# 发送 json_bytes 到网络...
received = json_bytes.decode("utf-8")
parsed = json.loads(received)

# 场景3：大小计算
msg = "Hello你好"
print(f"字符数: {len(msg)}")                    # 7  (5个英文字母 + 2个中文字)
print(f"UTF-8 字节数: {len(msg.encode('utf-8'))}")  # 11 (5×1 + 2×3)
print(f"GBK 字节数: {len(msg.encode('gbk'))}")      # 9  (5×1 + 2×2)

5.5 str / bytes / bytearray 速查对比

特性	str	bytes	bytearray
可变性	❌ 不可变	❌ 不可变	✅ 可变
字面量	`"hello"`	`b"hello"`	无
索引返回	`str` (字符)	`int` (0-255)	`int` (0-255)
中文支持	✅	❌ (需编码)	❌ (需编码)
编码方法	`.encode()`	`.decode()`	`.decode()`
网络传输	❌ 需编码	✅	✅
字典 key	✅	✅	❌ (不可哈希)
典型场景	文本处理	网络I/O、文件I/O	需修改的二进制数据

6. 实操 Demo：日志解析器（正则实战）

6.1 需求描述

给定一段 Nginx 风格的访问日志，要求解析出每条日志的结构化数据：

192.168.1.100 - - [30/May/2026:10:15:30 +0800] "GET /api/users HTTP/1.1" 200 1234 "https://example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
10.0.0.1 - admin [30/May/2026:10:16:45 +0800] "POST /api/login HTTP/1.1" 401 56 "-" "curl/7.68.0"
172.16.0.50 - - [30/May/2026:10:17:12 +0800] "GET /static/logo.png HTTP/1.1" 304 0 "https://example.com/home" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"

需要提取：IP、用户、时间、请求方法、请求路径、协议、状态码、响应大小、来源页、UA。

6.2 完整实现

"""
日志解析器 —— Python 字符串与正则实战
功能：解析 Nginx 访问日志，生成统计报告
"""

import re
from collections import Counter, defaultdict
from dataclasses import dataclass, field
from typing import Optional


# ==================== 数据模型 ====================

@dataclass
class LogEntry:
    """单条日志记录"""
    ip: str
    user: str
    datetime: str
    method: str
    path: str
    protocol: str
    status: int
    size: int
    referer: str
    user_agent: str


@dataclass
class LogReport:
    """日志统计报告"""
    total_requests: int = 0
    status_counter: Counter = field(default_factory=Counter)
    method_counter: Counter = field(default_factory=Counter)
    ip_counter: Counter = field(default_factory=Counter)
    path_counter: Counter = field(default_factory=Counter)
    total_bytes: int = 0
    entries: list = field(default_factory=list)


# ==================== 正则模式 ====================

# 使用 VERBOSE 模式，让正则可读
NGINX_LOG_PATTERN = re.compile(r"""
    ^(?P<ip>\S+)                    # IP 地址
    \s+-\s+                         # 分隔符: -
    (?P<user>\S+)                   # 用户（- 表示匿名）
    \s+\[                           # 分隔符: [
    (?P<datetime>[^\]]+)            # 时间：30/May/2026:10:15:30 +0800
    \]\s+"                          # 分隔符: ] "
    (?P<method>GET|POST|PUT|DELETE|PATCH|HEAD|OPTIONS)  # 请求方法
    \s+                             # 空格
    (?P<path>\S+)                   # 请求路径
    \s+                             # 空格
    (?P<protocol>HTTP/[\d.]+)       # 协议版本
    "\s+                            # 分隔符: "
    (?P<status>\d{3})               # 状态码
    \s+                             # 空格
    (?P<size>\d+)                   # 响应大小
    \s+"                            # 分隔符: "
    (?P<referer>[^"]*)              # 来源页
    "\s+"                           # 分隔符: " "
    (?P<user_agent>[^"]*)           # User-Agent
    ".*$                            # 结尾
""", re.VERBOSE)


# ==================== 解析器 ====================

class LogParser:
    """Nginx 日志解析器"""

    def __init__(self, pattern: re.Pattern = NGINX_LOG_PATTERN):
        self.pattern = pattern

    def parse_line(self, line: str) -> Optional[LogEntry]:
        """解析单行日志"""
        line = line.strip()
        if not line:
            return None

        match = self.pattern.match(line)
        if not match:
            print(f"[WARN] 无法解析: {line[:80]}...")
            return None

        d = match.groupdict()
        return LogEntry(
            ip=d["ip"],
            user=d["user"] if d["user"] != "-" else "anonymous",
            datetime=d["datetime"],
            method=d["method"],
            path=d["path"],
            protocol=d["protocol"],
            status=int(d["status"]),
            size=int(d["size"]),
            referer=d["referer"] if d["referer"] != "-" else "",
            user_agent=d["user_agent"],
        )

    def parse_file(self, filepath: str) -> LogReport:
        """解析日志文件"""
        report = LogReport()

        with open(filepath, "r", encoding="utf-8") as f:
            for line in f:
                entry = self.parse_line(line)
                if entry:
                    report.entries.append(entry)
                    report.total_requests += 1
                    report.status_counter[entry.status] += 1
                    report.method_counter[entry.method] += 1
                    report.ip_counter[entry.ip] += 1
                    report.path_counter[entry.path] += 1
                    report.total_bytes += entry.size

        return report

    def parse_string(self, log_text: str) -> LogReport:
        """解析日志字符串（用于测试）"""
        report = LogReport()

        for line in log_text.strip().split("\n"):
            entry = self.parse_line(line)
            if entry:
                report.entries.append(entry)
                report.total_requests += 1
                report.status_counter[entry.status] += 1
                report.method_counter[entry.method] += 1
                report.ip_counter[entry.ip] += 1
                report.path_counter[entry.path] += 1
                report.total_bytes += entry.size

        return report


# ==================== 报告生成 ====================

class ReportPrinter:
    """格式化输出统计报告"""

    @staticmethod
    def print_report(report: LogReport):
        """输出完整报告"""
        print("=" * 60)
        print("📊 Nginx 日志分析报告")
        print("=" * 60)

        # 总览
        print(f"\n📋 总览")
        print(f"  总请求数:   {report.total_requests}")
        print(f"  总流量:     {report.total_bytes:,} bytes "
              f"({report.total_bytes / 1024:.1f} KB)")

        # 状态码分布
        print(f"\n📈 状态码分布")
        for status, count in report.status_counter.most_common():
            bar = "█" * (count * 40 // report.total_requests)
            print(f"  {status}: {count:>4}  {bar}")

        # 请求方法分布
        print(f"\n🔧 请求方法分布")
        for method, count in report.method_counter.most_common():
            print(f"  {method:<8}: {count}")

        # TOP 5 IP
        print(f"\n🌐 TOP 5 访问 IP")
        for ip, count in report.ip_counter.most_common(5):
            print(f"  {ip:<18}: {count} 次")

        # TOP 5 路径
        print(f"\n📍 TOP 5 访问路径")
        for path, count in report.path_counter.most_common(5):
            print(f"  {path:<30}: {count} 次")

        print("\n" + "=" * 60)

    @staticmethod
    def search_entries(report: LogReport, **kwargs) -> list[LogEntry]:
        """按条件筛选日志条目"""
        results = report.entries
        for key, value in kwargs.items():
            results = [e for e in results if getattr(e, key) == value]
        return results


# ==================== 主程序 ====================

def main():
    """主函数：演示完整流程"""

    # 模拟日志数据
    sample_log = """
192.168.1.100 - - [30/May/2026:10:15:30 +0800] "GET /api/users HTTP/1.1" 200 1234 "https://example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
10.0.0.1 - admin [30/May/2026:10:16:45 +0800] "POST /api/login HTTP/1.1" 401 56 "-" "curl/7.68.0"
172.16.0.50 - - [30/May/2026:10:17:12 +0800] "GET /static/logo.png HTTP/1.1" 304 0 "https://example.com/home" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"
192.168.1.100 - - [30/May/2026:10:18:00 +0800] "GET /api/users HTTP/1.1" 200 2456 "https://example.com/users" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
10.0.0.1 - admin [30/May/2026:10:18:30 +0800] "POST /api/login HTTP/1.1" 200 234 "-" "curl/7.68.0"
192.168.1.200 - - [30/May/2026:10:19:05 +0800] "DELETE /api/users/5 HTTP/1.1" 403 89 "https://example.com/admin" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
172.16.0.50 - - [30/May/2026:10:19:30 +0800] "GET /static/style.css HTTP/1.1" 200 8765 "https://example.com/home" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"
192.168.1.100 - - [30/May/2026:10:20:00 +0800] "PUT /api/users/1 HTTP/1.1" 200 567 "https://example.com/users/1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
10.0.0.2 - - [30/May/2026:10:20:15 +0800] "GET /api/products HTTP/1.1" 200 4532 "https://example.com/shop" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_0 like Mac OS X)"
192.168.1.100 - - [30/May/2026:10:21:00 +0800] "GET /api/users HTTP/1.1" 500 120 "https://example.com/users" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
"""

    # 1. 解析日志
    parser = LogParser()
    report = parser.parse_string(sample_log)

    # 2. 输出报告
    ReportPrinter.print_report(report)

    # 3. 条件筛选示例
    print("\n🔍 筛选示例：状态码为 200 的请求")
    success_entries = ReportPrinter.search_entries(report, status=200)
    for entry in success_entries:
        print(f"  {entry.ip} → {entry.method} {entry.path} ({entry.status})")

    print("\n🔍 筛选示例：来自 192.168.1.100 的请求")
    ip_entries = ReportPrinter.search_entries(report, ip="192.168.1.100")
    for entry in ip_entries:
        print(f"  {entry.method} {entry.path} → {entry.status}")

    # 4. 字符串操作技巧演示
    print("\n" + "=" * 60)
    print("💡 日志解析中用到的字符串技巧")
    print("=" * 60)

    line = '192.168.1.100 - - [30/May/2026:10:15:30 +0800] "GET /api/users HTTP/1.1" 200 1234'

    # 技巧1：split 快速提取
    ip = line.split()[0]
    print(f"split 取 IP: {ip}")

    # 技巧2：strip 去除引号
    raw_request = '"GET /api/users HTTP/1.1"'
    request = raw_request.strip('"')
    print(f"strip 去引号: {request}")

    # 技巧3：正则命名分组提取
    time_match = re.search(r"\[(?P<time>[^\]]+)\]", line)
    if time_match:
        print(f"正则提时间: {time_match.group('time')}")

    # 技巧4：f-string 格式化输出
    status = 200
    size = 1234
    print(f"格式化输出: status={status:03d}, size={size:,} bytes")

    # 技巧5：join 拼接路径
    segments = ["api", "v2", "users", "123"]
    url_path = "/" + "/".join(segments)
    print(f"join 拼路径: {url_path}")


if __name__ == "__main__":
    main()

6.3 运行结果

============================================================
📊 Nginx 日志分析报告
============================================================

📋 总览
  总请求数:   10
  总流量:     18,053 bytes (17.6 KB)

📈 状态码分布
  200:    6  ████████████████████████
  401:    1  ████
  304:    1  ████
  403:    1  ████
  500:    1  ████

🔧 请求方法分布
  GET     : 6
  POST    : 2
  PUT     : 1
  DELETE  : 1

🌐 TOP 5 访问 IP
  192.168.1.100     : 4 次
  10.0.0.1          : 2 次
  172.16.0.50       : 2 次
  192.168.1.200     : 1 次
  10.0.0.2          : 1 次

📍 TOP 5 访问路径
  /api/users               : 3 次
  /api/login               : 2 次
  /static/logo.png         : 1 次
  /static/style.css        : 1 次
  /api/users/5             : 1 次

============================================================

🔍 筛选示例：状态码为 200 的请求
  192.168.1.100 → GET /api/users (200)
  192.168.1.100 → GET /api/users (200)
  10.0.0.1 → POST /api/login (200)
  172.16.0.50 → GET /static/style.css (200)
  192.168.1.100 → PUT /api/users/1 (200)
  10.0.0.2 → GET /api/products (200)

🔍 筛选示例：来自 192.168.1.100 的请求
  GET /api/users → 200
  GET /api/users → 200
  PUT /api/users/1 → 200
  GET /api/users → 500

💡 日志解析中用到的字符串技巧
============================================================
split 取 IP: 192.168.1.100
strip 去引号: GET /api/users HTTP/1.1
正则提时间: 30/May/2026:10:15:30 +0800
格式化输出: status=200, size=1,234 bytes
join 拼路径: /api/v2/users/123

7. 总结与速查表

7.1 字符串方法速查

分类	方法	说明	示例
查找	`find()`	返回索引或 -1	`"hello".find("ll")` → 2
	`index()`	返回索引或异常	`"hello".index("ll")` → 2
	`count()`	出现次数	`"hello".count("l")` → 2
替换	`replace()`	替换子串	`"aabb".replace("a","x")` → “xxbb”
分割	`split()`	分割为列表	`"a,b,c".split(",")` → [‘a’,‘b’,‘c’]
	`rsplit()`	从右分割	`"a-b-c".rsplit("-",1)` → [‘a-b’,‘c’]
	`splitlines()`	按行分割	`"a\nb".splitlines()` → [‘a’,‘b’]
	`join()`	合并为字符串	`",".join(['a','b'])` → “a,b”
去除	`strip()`	去两端字符	`" hi ".strip()` → “hi”
	`lstrip()`	去左端	`" hi ".lstrip()` → "hi "
	`rstrip()`	去右端	`" hi ".rstrip()` → " hi"
大小写	`upper()`	全大写	`"Hi".upper()` → “HI”
	`lower()`	全小写	`"Hi".lower()` → “hi”
	`title()`	标题格式	`"hi world".title()` → “Hi World”
	`capitalize()`	首字母大写	`"hi".capitalize()` → “Hi”
	`swapcase()`	大小写互换	`"Hi".swapcase()` → “hI”
判断	`startswith()`	前缀判断	`"test.py".startswith("test")` → True
	`endswith()`	后缀判断	`"test.py".endswith(".py")` → True
	`isdigit()`	全是数字	`"123".isdigit()` → True
	`isalpha()`	全是字母	`"abc".isalpha()` → True
	`isalnum()`	字母或数字	`"abc123".isalnum()` → True
对齐	`center()`	居中	`"hi".center(6)` → " hi "
	`ljust()`	左对齐	`"hi".ljust(6)` → "hi "
	`rjust()`	右对齐	`"hi".rjust(6)` → " hi"
	`zfill()`	左补零	`"42".zfill(5)` → “00042”
编解码	`encode()`	str→bytes	`"你好".encode("utf-8")`
	`decode()`	bytes→str	`b'\xe4...'.decode("utf-8")`

7.2 re 模块 API 速查

API	功能	返回值
`re.search(pattern, string)`	搜索第一个匹配	Match 或 None
`re.match(pattern, string)`	匹配字符串开头	Match 或 None
`re.fullmatch(pattern, string)`	完整匹配整个字符串	Match 或 None
`re.findall(pattern, string)`	找出所有匹配	列表
`re.finditer(pattern, string)`	迭代器版 findall	迭代器
`re.sub(pattern, repl, string)`	替换	新字符串
`re.split(pattern, string)`	正则分割	列表
`re.compile(pattern)`	预编译正则	Pattern 对象

7.3 关键要点回顾

字符串不可变：所有"修改"操作都返回新字符串，原串不变
切片左闭右开：s[2:5] 取索引 2、3、4，不取 5
f-string 首选：可读性好、性能优、功能强（3.6+）
原始字符串 r""：正则和 Windows 路径必备，避免转义地狱
bytes vs str：网络 I/O 用 bytes，文本处理用 str，encode/decode 桥接
正则 VERBOSE：复杂正则一定要用 re.VERBOSE，可读性 >> 简洁性
非贪婪模式：.*? 是 HTML/XML 提取的标配
预编译：同一正则反复使用时，re.compile() 提升性能

以上就是Python字符串从入门到精通的实战指南的详细内容，更多关于Python字符串指南的资料请关注脚本之家其它相关文章！