ExNlp.Ranking (ex_nlp v0.1.0)

View Source

Unified API for ranking algorithms.

This module provides access to various ranking algorithms for information retrieval and text mining, including TF-IDF and BM25.

Overview

Both TF-IDF and BM25 are ranking algorithms used to score documents based on their relevance to queries:

  • TF-IDF (Term Frequency-Inverse Document Frequency): A classic weighting scheme that reflects how important a word is to a document in a corpus. Used widely in text mining and information retrieval.

  • BM25 (Best Matching 25): An evolution of TF-IDF that addresses some limitations, particularly with document length normalization and term frequency saturation. Considered more effective for search engines.

Both algorithms share similar preprocessing capabilities:

  • Tokenization (with custom tokenizers)
  • Stemming (multiple languages)
  • Stop word removal

Examples

# TF-IDF
iex> documents = ["The quick brown fox", "A brown dog"]
iex> ExNlp.Ranking.TfIdf.calculate("fox", "The quick brown fox", documents)
0.5108256237659907

# BM25
iex> documents = ["BM25 is a ranking function", "used by search engines"]
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"])
[1.8455076734299591, 1.0126973514850315]

Module Structure

Reference:

Summary

Functions

Convenience function to score documents with BM25 (delegates to Bm25.score/3).

Convenience function to calculate TF-IDF (delegates to TfIdf.calculate/4).

Functions

bm25(documents, query, opts \\ [])

@spec bm25([String.t()] | [[String.t()]], [String.t()] | String.t(), keyword()) :: [
  float()
]

Convenience function to score documents with BM25 (delegates to Bm25.score/3).

tf_idf(word, document, corpus, opts \\ [])

@spec tf_idf(
  String.t(),
  String.t() | [String.t()],
  [String.t()] | [[String.t()]],
  keyword()
) :: float()

Convenience function to calculate TF-IDF (delegates to TfIdf.calculate/4).