bm25s

BM25S 项目介绍

什么是 BM25S?

BM25S 是一个用纯 Python 实现的超快速的 BM25 算法库，依托于 Scipy 稀疏矩阵技术来提升性能。BM25 是文本检索任务中使用广泛的排序函数，也是像 Elasticsearch 这样的搜索服务的核心组件。

BM25S 的特点

BM25S 以以下两大特点闻名：

快速：借助于 Scipy 稀疏矩阵，BM25S 可以在查询时极快地评分，为所有文档中的词语预先计算得分，从而大幅提高性能。
简单：安装 BM25S 非常简单，可以通过 pip 快速安装并使用。它没有对 Java 或 Pytorch 的依赖，只需要 Scipy 和 Numpy，以及可选的轻量级词干化依赖。

安装与快速开始

安装 BM25S

用户可以通过下面的命令来安装 BM25S：

pip install bm25s

对于更好的检索结果，用户还可以安装词干化工具：

pip install bm25s[full]
pip install PyStemmer
pip install jax[cpu]

快速开始

下面是一个简短的例子，展示如何使用 BM25S 来检索文本：

import bm25s
import Stemmer

corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

stemmer = Stemmer.Stemmer("english")
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

retriever = bm25s.BM25()
retriever.index(corpus_tokens)

query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)

for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")