ExNlp.Ranking.Bm25 (ex_nlp v0.1.0)
View SourceBM25 ranking algorithm implementation.
BM25 (Best Matching 25) is a ranking function used to rank documents based on their relevance to a given search query. It is considered one of the most effective ranking functions for text-based information retrieval.
This implementation integrates with ExNlp's tokenizers and stemmers for flexible text processing.
Examples
iex> documents = [
...> "BM25 is a ranking function",
...> "used by search engines",
...> "to rank matching documents"
...> ]
iex> query = ["ranking", "search", "function"]
iex> ExNlp.Ranking.Bm25.score(documents, query)
[1.8455076734299591, 1.0126973514850315, 0.0]
# With stemming
iex> documents = ["BM25 is a ranking function", "used by search engines", "to rank matching documents"]
...> query = ["ranking", "search"]
...> ExNlp.Ranking.Bm25.score(documents, query, stem: true, language: :english)
[0.4421744669877645, 1.0126973514850315, 0.48527450528621086]
# Score a single document
iex> documents = ["BM25 is a ranking function", "used by search engines", "to rank matching documents"]
...> query = ["ranking", "search", "function"]
...> ExNlp.Ranking.Bm25.score_document("BM25 is a ranking function", query, documents)
1.304211142369371
# With options
iex> documents = ["BM25 is a ranking function", "used by search engines"]
...> query = ["ranking", "search"]
...> ExNlp.Ranking.Bm25.score_document("BM25 is a ranking function", query, documents, k1: 1.5, b: 0.8)
0.4462059771320275Reference: https://en.wikipedia.org/wiki/Okapi_BM25
Summary
Functions
Batch calculate IDF for multiple terms using BM25 variant.
Calculate BM25 normalization factor for a document.
Scores documents against a query using BM25.
Scores a single document against a query.
Scores a single document from pre-processed tokens against a query.
Scores documents from pre-processed token lists using BM25.
Calculate BM25 score for a single term in a document.
Types
@type options() :: keyword() | ExNlp.Ranking.Bm25Options.t()
Options for BM25 calculation. Can be a keyword list or a Bm25Options struct.
Keyword list options:
:k1- Float. Controls term frequency saturation (default:1.2):b- Float. Controls length normalization (default:0.75):stem- Iftrue, apply stemming to words (default:false):language- Language for stemming (default::english):tokenizer- Custom tokenizer function (default: usesTokenizer.word_tokenize/1):remove_stopwords- Iftrue, remove stop words (default:false):stopword_language- Language for stop words (default::english)
Alternatively, you can pass a Bm25Options struct created with Bm25Options.new/1.
Functions
Batch calculate IDF for multiple terms using BM25 variant.
More efficient than calling inverse_document_frequency/4 repeatedly
when you need IDF values for many terms.
Arguments
terms- List of terms to calculate IDF forcorpus- List of tokenized documentsopts- Options (currently unused, reserved for future extensions)
Examples
iex> terms = ["search", "engine", "ranking"]
iex> corpus = [["search", "engine"], ["engine", "ranking"], ["search"]]
iex> idf_map = ExNlp.Ranking.Bm25.batch_idf(terms, corpus)
iex> Map.has_key?(idf_map, "search")
true
iex> Map.has_key?(idf_map, "engine")
true
Calculate BM25 normalization factor for a document.
Returns the length normalization component of the BM25 formula. Useful for pre-computing per-document normalization factors.
Arguments
doc_length- Length of document in tokensavg_doc_length- Average document length in corpusopts- BM25 parameters (onlybis used)
Examples
iex> ExNlp.Ranking.Bm25.normalization_factor(10, 15.0, b: 0.75)
0.75
Scores documents against a query using BM25.
Returns a list of scores, one for each document. Higher scores indicate greater relevance.
Arguments
documents- List of document texts (strings) or pre-tokenized listsquery- Query as a list of keywords or a single stringopts- Options (seeoptions/0)
Examples
iex> documents = [
...> "BM25 is a ranking function",
...> "used by search engines",
...> "to rank matching documents"
...> ]
iex> query = ["ranking", "search", "function"]
iex> ExNlp.Ranking.Bm25.score(documents, query)
[1.8455076734299591, 1.0126973514850315, 0.0]
# Query as a single string
iex> documents = ["BM25 is a ranking function", "used by search engines", "to rank matching documents"]
...> ExNlp.Ranking.Bm25.score(documents, "ranking search function")
[1.8455076734299591, 1.0126973514850315, 0.0]
# Custom k1 and b parameters
iex> documents = ["BM25 is a ranking function", "used by search engines", "to rank matching documents"]
...> query = ["ranking", "search", "function"]
...> ExNlp.Ranking.Bm25.score(documents, query, k1: 1.5, b: 0.8)
[1.8267593537467681, 1.0184329304434858, 0.0]
@spec score_document( String.t() | [String.t()], [String.t()] | String.t(), [String.t()] | [[String.t()]], options() ) :: float()
Scores a single document against a query.
Examples
iex> documents = ["BM25 is a ranking function", "used by search engines"]
iex> query = ["ranking", "search"]
iex> ExNlp.Ranking.Bm25.score_document("BM25 is a ranking function", query, documents)
0.4495686888437472
# With options
iex> documents = ["BM25 is a ranking function", "used by search engines"]
...> query = ["ranking", "search"]
...> ExNlp.Ranking.Bm25.score_document("BM25 is a ranking function", query, documents, k1: 1.5, b: 0.8)
0.4462059771320275
@spec score_document_from_tokens( [String.t()], [String.t()], [[String.t()]], options() ) :: float()
Scores a single document from pre-processed tokens against a query.
Arguments
document- Pre-processed token listquery- Pre-processed token listcorpus- List of pre-processed token lists (for IDF calculation)opts- Options (only:k1and:bare used)
Examples
iex> processed_doc = ["bm25", "ranking", "function"]
iex> processed_query = ["ranking", "search"]
iex> processed_corpus = [["bm25", "ranking", "function"], ["search", "engines"]]
iex> ExNlp.Ranking.Bm25.score_document_from_tokens(processed_doc, processed_query, processed_corpus)
0.4344571362775708
Scores documents from pre-processed token lists using BM25.
This is more efficient when you already have tokenized and processed documents, as it skips the tokenization and processing steps.
Arguments
documents- List of pre-processed token listsquery- Pre-processed token listopts- Options (only:k1and:bare used, other processing options ignored)
Examples
iex> processed_docs = [["bm25", "ranking", "function"], ["search", "engines"]]
iex> processed_query = ["ranking", "search"]
iex> ExNlp.Ranking.Bm25.score_from_tokens(processed_docs, processed_query)
[0.64072428455121, 0.7549127709068711]
# With custom k1 and b
iex> processed_docs = [["bm25", "ranking", "function"], ["search", "engines"]]
...> processed_query = ["ranking", "search"]
...> ExNlp.Ranking.Bm25.score_from_tokens(processed_docs, processed_query, k1: 1.5, b: 0.8)
[0.6324335589050596, 0.7667557307079041]
Calculate BM25 score for a single term in a document.
Useful for inverted index construction where you score terms individually. This is more efficient than scoring entire documents when you only need per-term scores.
Arguments
term_tf- Term frequency in document (pass pre-computed for efficiency)doc_length- Length of document in tokensidf- Pre-computed IDF for this termavg_doc_length- Average document length in corpusopts- BM25 parameters (k1, b)
Examples
iex> ExNlp.Ranking.Bm25.score_term(2, 10, 1.5, 15.0, k1: 1.2, b: 0.75)
2.2758620689655173