Python基于similarities实现文本语义相似度计算和文本匹配搜索
作者:小龙在山东
similarities 实现了多种相似度计算、匹配搜索算法,支持文本、图像,python3开发。
安装
pip3 install torch # conda install pytorch pip3 install -U similarities
或
git clone https://github.com/shibing624/similarities.git cd similarities python3 setup.py install
报错
ChineseCLIPProcessor
Traceback (most recent call last): File “xx\similarity_test1.py”,
line 9, in
from similarities import BertSimilarity File “xx\lib\site-packages\similarities_init_.py”, line 28, in
from similarities.clip_similarity import ClipSimilarity File “xx\lib\site-packages\similarities\clip_similarity.py”, line 16, in
from similarities.clip_module import ClipModule File “xx\lib\site-packages\similarities\clip_module.py”, line 18, in
from transformers import ChineseCLIPProcessor, ChineseCLIPModel, CLIPProcessor, CLIPModel ImportError: cannot import name
‘ChineseCLIPProcessor’ from ‘transformers’
(xx\lib\site-packages\transformers_init_.py)
报这个错的原因是transformers版本太低,升级下版本就可以了。
pip install --upgrade transformers
pydantic
另外还缺少pydantic:
pip install pydantic
样例
# -*- coding: utf-8 -*- """ @author:XuMing(xuming624@qq.com) @description: 文本语义相似度计算和文本匹配搜索 """ import sys sys.path.append('..') from similarities import BertSimilarity # 1.Compute cosine similarity between two sentences. sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡'] corpus = [ '花呗更改绑定银行卡', '我什么时候开通了花呗', '俄罗斯警告乌克兰反对欧盟协议', '暴风雨掩埋了东北部;新泽西16英寸的降雪', '中央情报局局长访问以色列叙利亚会谈', '人在巴基斯坦基地的炸弹袭击中丧生', ] model = BertSimilarity(model_name_or_path="shibing624/text2vec-base-chinese") print(model) similarity_score = model.similarity(sentences[0], sentences[1]) print(f"{sentences[0]} vs {sentences[1]}, score: {float(similarity_score):.4f}") print('-' * 50 + '\n') # 2.Compute similarity between two list similarity_scores = model.similarity(sentences, corpus) print(similarity_scores.numpy()) for i in range(len(sentences)): for j in range(len(corpus)): print(f"{sentences[i]} vs {corpus[j]}, score: {similarity_scores.numpy()[i][j]:.4f}") print('-' * 50 + '\n') # 3.Semantic Search model.add_corpus(corpus) res = model.most_similar(queries=sentences, topn=3) print(res) for q_id, id_score_dict in res.items(): print('query:', sentences[q_id]) print("search top 3:") for corpus_id, s in id_score_dict.items(): print(f'\t{model.corpus[corpus_id]}: {s:.4f}') print('-' * 50 + '\n') print(model.search(sentences[0], topn=3))
结果:
Similarity: BertSimilarity, matching_model: <SentenceModel: shibing624/text2vec-base-chinese, encoder_type: MEAN, max_seq_length: 256, emb_dim: 768>
2024-03-07 20:12:46.481 | DEBUG | text2vec.sentence_model:__init__:80 - Use device: cpu
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.8551
--------------------------------------------------
[[0.8551465 0.72119546 0.14502521 0.21666759 0.25171342 0.08089039]
[0.9999997 0.6807433 0.17136583 0.21621695 0.27282682 0.12791349]]
如何更换花呗绑定银行卡 vs 花呗更改绑定银行卡, score: 0.8551
如何更换花呗绑定银行卡 vs 我什么时候开通了花呗, score: 0.7212
如何更换花呗绑定银行卡 vs 俄罗斯警告乌克兰反对欧盟协议, score: 0.1450
如何更换花呗绑定银行卡 vs 暴风雨掩埋了东北部;新泽西16英寸的降雪, score: 0.2167
如何更换花呗绑定银行卡 vs 中央情报局局长访问以色列叙利亚会谈, score: 0.2517
如何更换花呗绑定银行卡 vs 人在巴基斯坦基地的炸弹袭击中丧生, score: 0.0809
花呗更改绑定银行卡 vs 花呗更改绑定银行卡, score: 1.0000
花呗更改绑定银行卡 vs 我什么时候开通了花呗, score: 0.6807
花呗更改绑定银行卡 vs 俄罗斯警告乌克兰反对欧盟协议, score: 0.1714
花呗更改绑定银行卡 vs 暴风雨掩埋了东北部;新泽西16英寸的降雪, score: 0.2162
花呗更改绑定银行卡 vs 中央情报局局长访问以色列叙利亚会谈, score: 0.2728
花呗更改绑定银行卡 vs 人在巴基斯坦基地的炸弹袭击中丧生, score: 0.1279
--------------------------------------------------
2024-03-07 20:13:03.429 | INFO | similarities.bert_similarity:add_corpus:108 - Start computing corpus embeddings, new docs: 6
Batches: 100%|██████████| 1/1 [00:10<00:00, 10.45s/it]
2024-03-07 20:13:13.889 | INFO | similarities.bert_similarity:add_corpus:120 - Add 6 docs, total: 6, emb len: 6
{0: {0: 0.8551465272903442, 1: 0.7211954593658447, 4: 0.25171342492103577}, 1: {0: 0.9999997019767761, 1: 0.6807432770729065, 4: 0.27282682061195374}}
query: 如何更换花呗绑定银行卡
search top 3:
花呗更改绑定银行卡: 0.8551
我什么时候开通了花呗: 0.7212
中央情报局局长访问以色列叙利亚会谈: 0.2517
query: 花呗更改绑定银行卡
search top 3:
花呗更改绑定银行卡: 1.0000
我什么时候开通了花呗: 0.6807
中央情报局局长访问以色列叙利亚会谈: 0.2728
--------------------------------------------------
{0: {0: 0.8551465272903442, 1: 0.7211954593658447, 4: 0.25171342492103577}}
以上就是Python基于similarities实现文本语义相似度计算和文本匹配搜索的详细内容,更多关于Python文本相似度计算的资料请关注脚本之家其它相关文章!