ExNlp.Ranking.Base (ex_nlp v0.1.0)
View SourceBase module for ExNlp ranking algorithms, providing shared utilities.
This module contains common functions used by both TF-IDF and BM25 implementations, such as token processing, stemming, stopword removal, and normalization.
Vector Interface
All ranking algorithms in ExNlp.Ranking use a unified vector representation:
- Token Vector:
[String.t()]- A list of processed token strings representing a document - Document Collection:
[[String.t()]]- A list of token vectors representing multiple documents
Both TF-IDF and BM25 provide two interfaces:
- Standard interface: Accepts strings or token lists, handles tokenization and processing
- Token interface: Accepts pre-processed token vectors (faster, skips tokenization)
Examples
# Standard interface (handles tokenization)
ExNlp.Ranking.TfIdf.calculate("word", "document text", ["doc1", "doc2"])
ExNlp.Ranking.Bm25.score(["doc1", "doc2"], "query text")
# Token interface (uses pre-processed vectors)
ExNlp.Ranking.TfIdf.calculate_from_tokens("word", ["doc", "text"], [["doc1"], ["doc2"]])
ExNlp.Ranking.Bm25.score_from_tokens([["doc1"], ["doc2"]], ["query", "text"])
Summary
Types
A token vector representing a single document.
A collection of token vectors representing multiple documents.
Functions
Optionally removes stopwords from a list of tokens.
Optionally stems a single word or list of tokens.
Normalizes scores based on the specified method.
Normalizes a list of scores using L1 (Manhattan) norm.
Normalizes a list of scores using L2 (Euclidean) norm.
Processes tokens by optionally removing stopwords and applying stemming.
Tokenizes and processes a document or token list.
Types
@type token_vector() :: [String.t()]
A token vector representing a single document.
This is a list of processed token strings. Tokens should already be stemmed and filtered (stopwords removed) if those options were used.
@type token_vector_collection() :: [[String.t()]]
A collection of token vectors representing multiple documents.
Functions
Optionally removes stopwords from a list of tokens.
Examples
iex> ExNlp.Ranking.Base.maybe_remove_stopwords(["the", "quick", "brown"], false, :english)
["the", "quick", "brown"]
iex> ExNlp.Ranking.Base.maybe_remove_stopwords(["the", "quick", "brown"], true, :english)
["quick", "brown"]
Optionally stems a single word or list of tokens.
Examples
iex> ExNlp.Ranking.Base.maybe_stem("running", true, :english)
"run"
iex> ExNlp.Ranking.Base.maybe_stem(["running", "jumping"], true, :english)
["run", "jump"]
iex> ExNlp.Ranking.Base.maybe_stem("running", false, :english)
"running"
Normalizes scores based on the specified method.
Arguments
scores- List of numeric scores to normalizemethod- Normalization method::l1,:l2, ornil(no normalization)
Examples
iex> ExNlp.Ranking.Base.normalize([3.0, 4.0], :l2)
[0.6, 0.8]
iex> ExNlp.Ranking.Base.normalize([1.0, 2.0, 3.0], :l1)
[0.16666666666666666, 0.3333333333333333, 0.5]
iex> ExNlp.Ranking.Base.normalize([1.0, 2.0, 3.0], nil)
[1.0, 2.0, 3.0]
Normalizes a list of scores using L1 (Manhattan) norm.
L1 normalization divides each score by the sum of all scores, resulting in scores that sum to 1.0 (probability distribution).
Examples
iex> ExNlp.Ranking.Base.normalize_l1([1.0, 2.0, 3.0])
[0.16666666666666666, 0.3333333333333333, 0.5]
iex> ExNlp.Ranking.Base.normalize_l1([2.0, 2.0])
[0.5, 0.5]
iex> ExNlp.Ranking.Base.normalize_l1([0.0, 0.0])
[0.0, 0.0]
Normalizes a list of scores using L2 (Euclidean) norm.
L2 normalization divides each score by the square root of the sum of squares, resulting in a unit vector.
Examples
iex> ExNlp.Ranking.Base.normalize_l2([3.0, 4.0])
[0.6, 0.8]
iex> ExNlp.Ranking.Base.normalize_l2([1.0, 2.0, 3.0])
[0.2672612419124244, 0.5345224838248488, 0.8017837257372732]
iex> ExNlp.Ranking.Base.normalize_l2([0.0, 0.0])
[0.0, 0.0]
@spec process_tokens([String.t()], ExNlp.Ranking.ProcessingOptions.t()) :: [ String.t() ]
Processes tokens by optionally removing stopwords and applying stemming.
This function applies preprocessing transformations in order:
- Stopword removal (if enabled)
- Stemming (if enabled)
Examples
iex> opts = ExNlp.Ranking.ProcessingOptions.new(stem: false, language: :english, remove_stopwords: false)
iex> ExNlp.Ranking.Base.process_tokens(["the", "quick", "brown", "fox"], opts)
["the", "quick", "brown", "fox"]
iex> opts = ExNlp.Ranking.ProcessingOptions.new(stem: false, language: :english, remove_stopwords: false)
iex> ExNlp.Ranking.Base.process_tokens(["the", "quick", "brown", "fox"], opts)
["the", "quick", "brown", "fox"]
iex> opts = ExNlp.Ranking.ProcessingOptions.new(stem: true, language: :english, remove_stopwords: false)
iex> ExNlp.Ranking.Base.process_tokens(["running", "jumping"], opts)
["run", "jump"]
@spec tokenize_and_process( String.t() | [String.t()], (String.t() -> [String.t()]), ExNlp.Ranking.ProcessingOptions.t() ) :: [String.t()]
Tokenizes and processes a document or token list.
This is a convenience function that handles both string documents and pre-tokenized lists, applying tokenization and processing as needed.
Arguments
doc- Document as a string or pre-tokenized listtokenizer- Tokenizer functionopts- Processing options
Examples
iex> tokenizer = &ExNlp.Tokenizer.word_tokenize/1
iex> opts = ExNlp.Ranking.ProcessingOptions.new(stem: true, language: :english)
iex> ExNlp.Ranking.Base.tokenize_and_process("running jumping", tokenizer, opts)
["run", "jump"]
iex> tokenizer = &ExNlp.Tokenizer.word_tokenize/1
iex> opts = ExNlp.Ranking.ProcessingOptions.new(stem: false)
iex> ExNlp.Ranking.Base.tokenize_and_process(["running", "jumping"], tokenizer, opts)
["running", "jumping"]