python

关注公众号 jb51net

关闭
首页 > 脚本专栏 > python > Python sentence_transformers 库

Python sentence_transformers 库的作用(生成句子、段落或图像的高质量嵌入(embeddings))

作者:彬彬侠

sentence_transformers是基于HuggingFace Transformers的Python库,用于生成文本和图像的高质量嵌入,支持多语言、跨模态及语义搜索等NLP任务,本文给大家介绍Python sentence_transformers 库的作用及详细说明,感兴趣的朋友一起看看吧

sentence_transformers 是一个用于生成句子、段落或图像的高质量嵌入(embeddings)的 Python 库,基于 Hugging Face 的 transformers 库。它通过预训练的 Transformer 模型(如 BERT、RoBERTa、DistilBERT 等)生成固定长度的密集向量表示,广泛应用于自然语言处理(NLP)任务,如语义搜索、文本相似性比较、聚类、释义挖掘等。以下是对 sentence_transformers 库的详细说明。

1.sentence_transformers 库的作用

2.安装与环境要求

根据官方文档,推荐以下环境:

注意:若需 GPU 加速,需根据系统 CUDA 版本安装 PyTorch,参考 PyTorch 官方指南

3.核心功能与用法

sentence_transformers 提供两种主要模型类型:SentenceTransformer(用于生成嵌入)和 CrossEncoder(用于重新排序或相似性评分)。

3.1SentenceTransformer:生成嵌入

SentenceTransformer 将文本编码为固定长度的向量,适合语义搜索、相似性比较等任务。

基本用法

from sentence_transformers import SentenceTransformer
# 加载预训练模型
model = SentenceTransformer("all-MiniLM-L6-v2")
# 待编码的句子
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium."
]
# 生成嵌入
embeddings = model.encode(sentences)
print(embeddings.shape)  # 输出: (3, 384)  # 3 个句子,每个嵌入 384 维
# 计算相似性
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# 输出: tensor([[1.0000, 0.6660, 0.1046],
#                [0.6660, 1.0000, 0.1411],
#                [0.1046, 0.1411, 1.0000]])

说明

带提示(Prompt)的高级用法
某些模型支持在推理时添加提示以优化性能,例如查询嵌入:

model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
query_embedding = model.encode("What are Pandas?", prompt_name="query")
document_embeddings = model.encode([
    "Pandas is a software library written for the Python programming language for data manipulation and analysis.",
    "Pandas are a species of bear native to South Central China."
])
similarity = model.similarity(query_embedding, document_embeddings)
print(similarity)  # 输出: tensor([[0.7594, 0.7560]])

3.2CrossEncoder:重新排序

CrossEncoder 用于对句子对进行直接评分,适合需要高精度的任务(如检索重排序)。

用法

from sentence_transformers import CrossEncoder
# 加载预训练 CrossEncoder 模型
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")
# 查询与文档
query = "Which planet is known as the Red Planet?"
passages = [
    "Venus is often called Earth's twin because of its similar size and proximity.",
    "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
    "Jupiter, the largest planet in our solar system, has a prominent red spot."
]
# 计算评分
scores = model.predict([(query, passage) for passage in passages])
print(scores)  # 输出: [0.0123, 0.9987, 0.3456](示例分数)

说明

4.预训练模型

sentence_transformers 提供超过 10,000 个预训练模型,托管在 Hugging Face Hub,覆盖以下类型:

加载模型

model = SentenceTransformer("all-MiniLM-L6-v2")  # 从 Hugging Face 下载

本地加载
若需离线使用,可先下载模型并保存:

model = SentenceTransformer("all-MiniLM-L6-v2")
model.save("local_model_path")
# 加载本地模型
model = SentenceTransformer("local_model_path")

5.微调与训练

sentence_transformers 支持微调模型以适应特定任务,提供多种损失函数和训练方式。

5.1训练示例

以下是一个微调模型的示例,用于语义文本相似性任务:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# 加载模型
model = SentenceTransformer("all-MiniLM-L6-v2")
# 准备训练数据
train_examples = [
    InputExample(texts=["The weather is nice today.", "It's pleasant outside."], label=0.9),
    InputExample(texts=["The weather is nice today.", "He went to the park."], label=0.2)
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# 定义损失函数
train_loss = losses.CosineSimilarityLoss(model)
# 训练模型
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    warmup_steps=100
)
# 保存模型
model.save("finetuned_model")

说明

5.2支持的损失函数

sentence_transformers.losses 提供多种损失函数,包括:

5.3训练提示

6.性能优化

7.实际应用场景

8.与 LangChain 集成

sentence_transformers 常与 LangChain 结合,用于构建基于语义的应用程序(如聊天机器人、文档检索):

from sentence_transformers import SentenceTransformer
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
# 初始化嵌入模型
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# 创建向量存储
texts = ["The weather is nice today.", "It's sunny outside."]
vector_store = FAISS.from_texts(texts, embeddings)
# 搜索
query = "How's the weather?"
results = vector_store.similarity_search(query, k=2)
print(results)

9.注意事项

10.综合示例

以下是一个综合示例,结合嵌入生成、相似性计算和 CrossEncoder 重排序:

from sentence_transformers import SentenceTransformer, CrossEncoder, util
# 加载模型
encoder = SentenceTransformer("all-MiniLM-L6-v2")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")
# 语料库和查询
query = "What is pandas?"
corpus = [
    "Pandas is a Python library for data analysis.",
    "Pandas are animals native to China.",
    "NumPy is a Python library for numerical computing."
]
# 生成嵌入并计算相似度
query_embedding = encoder.encode(query)
corpus_embeddings = encoder.encode(corpus)
similarities = util.cos_sim(query_embedding, corpus_embeddings)
# 获取初步检索结果
top_k = 3
hits = [{"corpus_id": i, "score": similarities[0][i]} for i in range(len(corpus))]
hits = sorted(hits, key=lambda x: x["score"], reverse=True)[:top_k]
# 使用 CrossEncoder 重排序
pairs = [(query, corpus[hit["corpus_id"]]) for hit in hits]
scores = reranker.predict(pairs)
# 输出结果
for hit, score in zip(hits, scores):
    print(f"Score: {score:.4f}, Text: {corpus[hit['corpus_id']]}")

输出示例

Score: 0.9987, Text: Pandas is a Python library for data analysis.
Score: 0.3456, Text: Pandas are animals native to China.
Score: 0.0123, Text: NumPy is a Python library for numerical computing.

11.资源与文档

到此这篇关于Python sentence_transformers 库的作用(生成句子、段落或图像的高质量嵌入(embeddings))的文章就介绍到这了,更多相关Python sentence_transformers 库内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!

您可能感兴趣的文章:
阅读全文