ExNlp.Ranking.Bm25 (ex_nlp v0.1.0)

View Source

BM25 ranking algorithm implementation.

BM25 (Best Matching 25) is a ranking function used to rank documents based on their relevance to a given search query. It is considered one of the most effective ranking functions for text-based information retrieval.

This implementation integrates with ExNlp's tokenizers and stemmers for flexible text processing.

Examples

iex> documents = [
...>   "BM25 is a ranking function",
...>   "used by search engines",
...>   "to rank matching documents"
...> ]
iex> query = ["ranking", "search", "function"]
iex> ExNlp.Ranking.Bm25.score(documents, query)
[1.8455076734299591, 1.0126973514850315, 0.0]

# With stemming
iex> documents = ["BM25 is a ranking function", "used by search engines", "to rank matching documents"]
...> query = ["ranking", "search"]
...> ExNlp.Ranking.Bm25.score(documents, query, stem: true, language: :english)
[0.4421744669877645, 1.0126973514850315, 0.48527450528621086]

# Score a single document
iex> documents = ["BM25 is a ranking function", "used by search engines", "to rank matching documents"]
...> query = ["ranking", "search", "function"]
...> ExNlp.Ranking.Bm25.score_document("BM25 is a ranking function", query, documents)
1.304211142369371

# With options
iex> documents = ["BM25 is a ranking function", "used by search engines"]
...> query = ["ranking", "search"]
...> ExNlp.Ranking.Bm25.score_document("BM25 is a ranking function", query, documents, k1: 1.5, b: 0.8)
0.4462059771320275

Reference: https://en.wikipedia.org/wiki/Okapi_BM25

Summary

Types

Options for BM25 calculation. Can be a keyword list or a Bm25Options struct.

Functions

Batch calculate IDF for multiple terms using BM25 variant.

Calculate BM25 normalization factor for a document.

Scores documents against a query using BM25.

Scores a single document against a query.

Scores a single document from pre-processed tokens against a query.

Scores documents from pre-processed token lists using BM25.

Calculate BM25 score for a single term in a document.

Types

options()

@type options() :: keyword() | ExNlp.Ranking.Bm25Options.t()

Options for BM25 calculation. Can be a keyword list or a Bm25Options struct.

Keyword list options:

  • :k1 - Float. Controls term frequency saturation (default: 1.2)
  • :b - Float. Controls length normalization (default: 0.75)
  • :stem - If true, apply stemming to words (default: false)
  • :language - Language for stemming (default: :english)
  • :tokenizer - Custom tokenizer function (default: uses Tokenizer.word_tokenize/1)
  • :remove_stopwords - If true, remove stop words (default: false)
  • :stopword_language - Language for stop words (default: :english)

Alternatively, you can pass a Bm25Options struct created with Bm25Options.new/1.

Functions

batch_idf(terms, corpus, opts \\ [])

@spec batch_idf([String.t()], [[String.t()]], options()) :: %{
  required(String.t()) => float()
}

Batch calculate IDF for multiple terms using BM25 variant.

More efficient than calling inverse_document_frequency/4 repeatedly when you need IDF values for many terms.

Arguments

  • terms - List of terms to calculate IDF for
  • corpus - List of tokenized documents
  • opts - Options (currently unused, reserved for future extensions)

Examples

iex> terms = ["search", "engine", "ranking"]
iex> corpus = [["search", "engine"], ["engine", "ranking"], ["search"]]
iex> idf_map = ExNlp.Ranking.Bm25.batch_idf(terms, corpus)
iex> Map.has_key?(idf_map, "search")
true
iex> Map.has_key?(idf_map, "engine")
true

normalization_factor(doc_length, avg_doc_length, opts \\ [])

@spec normalization_factor(integer(), float(), options()) :: float()

Calculate BM25 normalization factor for a document.

Returns the length normalization component of the BM25 formula. Useful for pre-computing per-document normalization factors.

Arguments

  • doc_length - Length of document in tokens
  • avg_doc_length - Average document length in corpus
  • opts - BM25 parameters (only b is used)

Examples

iex> ExNlp.Ranking.Bm25.normalization_factor(10, 15.0, b: 0.75)
0.75

score(documents, query, opts \\ [])

@spec score([String.t()] | [[String.t()]], [String.t()] | String.t(), options()) :: [
  float()
]

Scores documents against a query using BM25.

Returns a list of scores, one for each document. Higher scores indicate greater relevance.

Arguments

  • documents - List of document texts (strings) or pre-tokenized lists
  • query - Query as a list of keywords or a single string
  • opts - Options (see options/0)

Examples

iex> documents = [
...>   "BM25 is a ranking function",
...>   "used by search engines",
...>   "to rank matching documents"
...> ]
iex> query = ["ranking", "search", "function"]
iex> ExNlp.Ranking.Bm25.score(documents, query)
[1.8455076734299591, 1.0126973514850315, 0.0]

# Query as a single string
iex> documents = ["BM25 is a ranking function", "used by search engines", "to rank matching documents"]
...> ExNlp.Ranking.Bm25.score(documents, "ranking search function")
[1.8455076734299591, 1.0126973514850315, 0.0]

# Custom k1 and b parameters
iex> documents = ["BM25 is a ranking function", "used by search engines", "to rank matching documents"]
...> query = ["ranking", "search", "function"]
...> ExNlp.Ranking.Bm25.score(documents, query, k1: 1.5, b: 0.8)
[1.8267593537467681, 1.0184329304434858, 0.0]

score_document(document, query, corpus, opts \\ [])

@spec score_document(
  String.t() | [String.t()],
  [String.t()] | String.t(),
  [String.t()] | [[String.t()]],
  options()
) :: float()

Scores a single document against a query.

Examples

iex> documents = ["BM25 is a ranking function", "used by search engines"]
iex> query = ["ranking", "search"]
iex> ExNlp.Ranking.Bm25.score_document("BM25 is a ranking function", query, documents)
0.4495686888437472

# With options
iex> documents = ["BM25 is a ranking function", "used by search engines"]
...> query = ["ranking", "search"]
...> ExNlp.Ranking.Bm25.score_document("BM25 is a ranking function", query, documents, k1: 1.5, b: 0.8)
0.4462059771320275

score_document_from_tokens(document, query, corpus, opts \\ [])

@spec score_document_from_tokens(
  [String.t()],
  [String.t()],
  [[String.t()]],
  options()
) :: float()

Scores a single document from pre-processed tokens against a query.

Arguments

  • document - Pre-processed token list
  • query - Pre-processed token list
  • corpus - List of pre-processed token lists (for IDF calculation)
  • opts - Options (only :k1 and :b are used)

Examples

iex> processed_doc = ["bm25", "ranking", "function"]
iex> processed_query = ["ranking", "search"]
iex> processed_corpus = [["bm25", "ranking", "function"], ["search", "engines"]]
iex> ExNlp.Ranking.Bm25.score_document_from_tokens(processed_doc, processed_query, processed_corpus)
0.4344571362775708

score_from_tokens(documents, query, opts \\ [])

@spec score_from_tokens([[String.t()]], [String.t()], options()) :: [float()]

Scores documents from pre-processed token lists using BM25.

This is more efficient when you already have tokenized and processed documents, as it skips the tokenization and processing steps.

Arguments

  • documents - List of pre-processed token lists
  • query - Pre-processed token list
  • opts - Options (only :k1 and :b are used, other processing options ignored)

Examples

iex> processed_docs = [["bm25", "ranking", "function"], ["search", "engines"]]
iex> processed_query = ["ranking", "search"]
iex> ExNlp.Ranking.Bm25.score_from_tokens(processed_docs, processed_query)
[0.64072428455121, 0.7549127709068711]

# With custom k1 and b
iex> processed_docs = [["bm25", "ranking", "function"], ["search", "engines"]]
...> processed_query = ["ranking", "search"]
...> ExNlp.Ranking.Bm25.score_from_tokens(processed_docs, processed_query, k1: 1.5, b: 0.8)
[0.6324335589050596, 0.7667557307079041]

score_term(term_tf, doc_length, idf, avg_doc_length, opts \\ [])

@spec score_term(non_neg_integer(), integer(), float(), float(), options()) :: float()

Calculate BM25 score for a single term in a document.

Useful for inverted index construction where you score terms individually. This is more efficient than scoring entire documents when you only need per-term scores.

Arguments

  • term_tf - Term frequency in document (pass pre-computed for efficiency)
  • doc_length - Length of document in tokens
  • idf - Pre-computed IDF for this term
  • avg_doc_length - Average document length in corpus
  • opts - BM25 parameters (k1, b)

Examples

iex> ExNlp.Ranking.Bm25.score_term(2, 10, 1.5, 15.0, k1: 1.2, b: 0.75)
2.2758620689655173