ExNlp.Ranking.Base (ex_nlp v0.1.0)

View Source

Base module for ExNlp ranking algorithms, providing shared utilities.

This module contains common functions used by both TF-IDF and BM25 implementations, such as token processing, stemming, stopword removal, and normalization.

Vector Interface

All ranking algorithms in ExNlp.Ranking use a unified vector representation:

  • Token Vector: [String.t()] - A list of processed token strings representing a document
  • Document Collection: [[String.t()]] - A list of token vectors representing multiple documents

Both TF-IDF and BM25 provide two interfaces:

  1. Standard interface: Accepts strings or token lists, handles tokenization and processing
  2. Token interface: Accepts pre-processed token vectors (faster, skips tokenization)

Examples

# Standard interface (handles tokenization)
ExNlp.Ranking.TfIdf.calculate("word", "document text", ["doc1", "doc2"])
ExNlp.Ranking.Bm25.score(["doc1", "doc2"], "query text")

# Token interface (uses pre-processed vectors)
ExNlp.Ranking.TfIdf.calculate_from_tokens("word", ["doc", "text"], [["doc1"], ["doc2"]])
ExNlp.Ranking.Bm25.score_from_tokens([["doc1"], ["doc2"]], ["query", "text"])

Summary

Types

A token vector representing a single document.

A collection of token vectors representing multiple documents.

Functions

Optionally removes stopwords from a list of tokens.

Optionally stems a single word or list of tokens.

Normalizes scores based on the specified method.

Normalizes a list of scores using L1 (Manhattan) norm.

Normalizes a list of scores using L2 (Euclidean) norm.

Processes tokens by optionally removing stopwords and applying stemming.

Tokenizes and processes a document or token list.

Types

token_vector()

@type token_vector() :: [String.t()]

A token vector representing a single document.

This is a list of processed token strings. Tokens should already be stemmed and filtered (stopwords removed) if those options were used.

token_vector_collection()

@type token_vector_collection() :: [[String.t()]]

A collection of token vectors representing multiple documents.

Functions

maybe_remove_stopwords(tokens, bool, language)

@spec maybe_remove_stopwords([String.t()], boolean(), atom()) :: [String.t()]

Optionally removes stopwords from a list of tokens.

Examples

iex> ExNlp.Ranking.Base.maybe_remove_stopwords(["the", "quick", "brown"], false, :english)
["the", "quick", "brown"]

iex> ExNlp.Ranking.Base.maybe_remove_stopwords(["the", "quick", "brown"], true, :english)
["quick", "brown"]

maybe_stem(tokens, bool, language)

@spec maybe_stem(String.t() | [String.t()], boolean(), atom()) ::
  String.t() | [String.t()]

Optionally stems a single word or list of tokens.

Examples

iex> ExNlp.Ranking.Base.maybe_stem("running", true, :english)
"run"

iex> ExNlp.Ranking.Base.maybe_stem(["running", "jumping"], true, :english)
["run", "jump"]

iex> ExNlp.Ranking.Base.maybe_stem("running", false, :english)
"running"

normalize(scores, atom)

@spec normalize([float()], :l1 | :l2 | nil) :: [float()]

Normalizes scores based on the specified method.

Arguments

  • scores - List of numeric scores to normalize
  • method - Normalization method: :l1, :l2, or nil (no normalization)

Examples

iex> ExNlp.Ranking.Base.normalize([3.0, 4.0], :l2)
[0.6, 0.8]

iex> ExNlp.Ranking.Base.normalize([1.0, 2.0, 3.0], :l1)
[0.16666666666666666, 0.3333333333333333, 0.5]

iex> ExNlp.Ranking.Base.normalize([1.0, 2.0, 3.0], nil)
[1.0, 2.0, 3.0]

normalize_l1(scores)

@spec normalize_l1([float()]) :: [float()]

Normalizes a list of scores using L1 (Manhattan) norm.

L1 normalization divides each score by the sum of all scores, resulting in scores that sum to 1.0 (probability distribution).

Examples

iex> ExNlp.Ranking.Base.normalize_l1([1.0, 2.0, 3.0])
[0.16666666666666666, 0.3333333333333333, 0.5]

iex> ExNlp.Ranking.Base.normalize_l1([2.0, 2.0])
[0.5, 0.5]

iex> ExNlp.Ranking.Base.normalize_l1([0.0, 0.0])
[0.0, 0.0]

normalize_l2(scores)

@spec normalize_l2([float()]) :: [float()]

Normalizes a list of scores using L2 (Euclidean) norm.

L2 normalization divides each score by the square root of the sum of squares, resulting in a unit vector.

Examples

iex> ExNlp.Ranking.Base.normalize_l2([3.0, 4.0])
[0.6, 0.8]

iex> ExNlp.Ranking.Base.normalize_l2([1.0, 2.0, 3.0])
[0.2672612419124244, 0.5345224838248488, 0.8017837257372732]

iex> ExNlp.Ranking.Base.normalize_l2([0.0, 0.0])
[0.0, 0.0]

normalize_word(word)

normalize_words(words, language)

process_tokens(tokens, opts)

@spec process_tokens([String.t()], ExNlp.Ranking.ProcessingOptions.t()) :: [
  String.t()
]

Processes tokens by optionally removing stopwords and applying stemming.

This function applies preprocessing transformations in order:

  1. Stopword removal (if enabled)
  2. Stemming (if enabled)

Examples

iex> opts = ExNlp.Ranking.ProcessingOptions.new(stem: false, language: :english, remove_stopwords: false)
iex> ExNlp.Ranking.Base.process_tokens(["the", "quick", "brown", "fox"], opts)
["the", "quick", "brown", "fox"]

iex> opts = ExNlp.Ranking.ProcessingOptions.new(stem: false, language: :english, remove_stopwords: false)
iex> ExNlp.Ranking.Base.process_tokens(["the", "quick", "brown", "fox"], opts)
["the", "quick", "brown", "fox"]

iex> opts = ExNlp.Ranking.ProcessingOptions.new(stem: true, language: :english, remove_stopwords: false)
iex> ExNlp.Ranking.Base.process_tokens(["running", "jumping"], opts)
["run", "jump"]

tokenize_and_process(doc, tokenizer, opts)

@spec tokenize_and_process(
  String.t() | [String.t()],
  (String.t() -> [String.t()]),
  ExNlp.Ranking.ProcessingOptions.t()
) :: [String.t()]

Tokenizes and processes a document or token list.

This is a convenience function that handles both string documents and pre-tokenized lists, applying tokenization and processing as needed.

Arguments

  • doc - Document as a string or pre-tokenized list
  • tokenizer - Tokenizer function
  • opts - Processing options

Examples

iex> tokenizer = &ExNlp.Tokenizer.word_tokenize/1
iex> opts = ExNlp.Ranking.ProcessingOptions.new(stem: true, language: :english)
iex> ExNlp.Ranking.Base.tokenize_and_process("running jumping", tokenizer, opts)
["run", "jump"]

iex> tokenizer = &ExNlp.Tokenizer.word_tokenize/1
iex> opts = ExNlp.Ranking.ProcessingOptions.new(stem: false)
iex> ExNlp.Ranking.Base.tokenize_and_process(["running", "jumping"], tokenizer, opts)
["running", "jumping"]