ExNlp.Ranking.TfIdf (ex_nlp v0.1.0)

Term Frequency-Inverse Document Frequency (TF-IDF) implementation.

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.

This implementation integrates with ExNlp's tokenizers and stemmers for flexible text processing.

Examples

iex> documents = [
...>   "The quick brown fox jumps over the lazy dog",
...>   "A brown dog is running in the park",
...>   "The fox and the dog are friends"
...> ]
iex> ExNlp.Ranking.TfIdf.calculate("fox", "The quick brown fox", documents)
0.07192051811294521

# With stemming
iex> documents = [
...>   "The quick brown fox jumps over the lazy dog",
...>   "A brown dog is running in the park",
...>   "The fox and the dog are friends"
...> ]
...> ExNlp.Ranking.TfIdf.calculate("jump", "The fox is jumping", documents, stem: true, language: :english)
0.17328679513998632

# Calculate TF-IDF for all words in a document
iex> documents = [
...>   "The quick brown fox jumps over the lazy dog",
...>   "A brown dog is running in the park",
...>   "The fox and the dog are friends"
...> ]
...> ExNlp.Ranking.TfIdf.calculate_all("The quick brown fox", documents)
[{"quick", 0.17328679513998632}, {"brown", 0.07192051811294521}, {"fox", 0.07192051811294521}, {"the", 0.0}]

Reference: https://en.wikipedia.org/wiki/Tf-idf

Summary

Types

options()

Options for TF-IDF calculation. Can be a keyword list or a TfIdfOptions struct.

Functions

batch_idf(terms, corpus, opts \\ [])

Batch calculate IDF for multiple terms.

calculate(word, document, corpus, opts \\ [])

Calculates TF-IDF score for a word in a document within a corpus.

calculate_all(document, corpus, opts \\ [])

Calculates TF-IDF scores for all words in a document.

calculate_from_tokens(word, document, corpus, opts \\ [])

Calculates TF-IDF score from pre-processed token lists.

document_term_frequencies(doc_tokens, variant \\ :normalized, sublinear \\ false)

Calculate term frequencies for all terms in a document at once.

inverse_document_frequency(word, corpus, variant \\ :standard, smooth \\ false)

Returns the inverse document frequency (IDF) of a word in a corpus.

raw_term_frequency(word, document)

Returns the raw term frequency (count) of a word in a document.

score(documents, query, opts \\ [])

Scores documents against a query using TF-IDF.

score_document(document, query, corpus, opts \\ [])

Scores a single document against a query using TF-IDF.

score_document_from_tokens(document, query, corpus, opts \\ [])

Scores a single document from pre-processed tokens against a query.

score_from_tokens(documents, query, opts \\ [])

Scores documents from pre-processed token lists using TF-IDF.

term_frequency(word, document, variant \\ :normalized, sublinear \\ false)

Returns the term frequency (TF) of a word in a document.

Types

options()

@type options() :: keyword() | ExNlp.Ranking.TfIdfOptions.t()

Options for TF-IDF calculation. Can be a keyword list or a TfIdfOptions struct.

Keyword list options:

:stem - If true, apply stemming to words (default: false)
:language - Language for stemming (default: :english)
:tokenizer - Custom tokenizer function (default: uses Tokenizer.word_tokenize/1)
:remove_stopwords - If true, remove stop words (default: false)
:stopword_language - Language for stop words (default: :english)
:smooth_idf - If true, use smoothed IDF (adds 1 to prevent division by zero, default: false)
:sublinear_tf - If true, use log scaling for TF: 1 + log(tf) instead of raw tf (default: false)
:normalize - Normalization method: :l2, :l1, or nil (default: nil)
:tf_variant - Term frequency variant: :normalized or :raw (default: :normalized)
:idf_variant - IDF variant: :standard or :bm25 (default: :standard)

Alternatively, you can pass a TfIdfOptions struct created with TfIdfOptions.new/1.

Functions

batch_idf(terms, corpus, opts \\ [])

@spec batch_idf([String.t()], [[String.t()]], options()) :: %{
  required(String.t()) => float()
}

Batch calculate IDF for multiple terms.

More efficient than calling inverse_document_frequency/4 repeatedly when you need IDF values for many terms.

Arguments

terms - List of terms to calculate IDF for
corpus - List of tokenized documents
opts - Options (see options/0)

Examples

iex> terms = ["search", "engine"]
iex> corpus = [["search", "engine"], ["engine", "query"]]
iex> idf_map = ExNlp.Ranking.TfIdf.batch_idf(terms, corpus)
iex> Map.has_key?(idf_map, "search")
true
iex> idf_map["engine"]
0.0

calculate(word, document, corpus, opts \\ [])

@spec calculate(
  String.t(),
  String.t() | [String.t()],
  [String.t()] | [[String.t()]],
  options()
) ::
  float()

Calculates TF-IDF score for a word in a document within a corpus.

Arguments

word - The word to calculate TF-IDF for
document - The document text (string) or pre-tokenized list
corpus - List of documents (strings or pre-tokenized lists)
opts - Options (see options/0)

Examples

iex> documents = ["dog hat", "dog", "cat mat", "duck"]
iex> ExNlp.Ranking.TfIdf.calculate("dog", "nice dog dog", documents)
0.3405504158439938

# With tokenized input
iex> tokenized_doc = ["nice", "dog", "dog"]
iex> tokenized_corpus = [["dog", "hat"], ["dog"], ["cat", "mat"], ["duck"]]
iex> ExNlp.Ranking.TfIdf.calculate("dog", tokenized_doc, tokenized_corpus)
0.3405504158439938

# With pre-processed tokens (faster, skips tokenization/processing)
iex> processed_doc = ["nice", "dog", "dog"]
iex> processed_corpus = [["dog", "hat"], ["dog"], ["cat", "mat"]]
iex> ExNlp.Ranking.TfIdf.calculate_from_tokens("dog", processed_doc, processed_corpus)
0.19178804830118723

calculate_all(document, corpus, opts \\ [])

@spec calculate_all(
  String.t() | [String.t()],
  [String.t()] | [[String.t()]],
  options()
) :: [
  {String.t(), float()}
]

Calculates TF-IDF scores for all words in a document.

Returns a list of {word, score} tuples sorted by score (descending).

Note: This function is slower than calculate_from_tokens/4 because it performs tokenization and processing. For better performance, use pre-processed tokens with calculate_from_tokens/4 or process tokens once and reuse them.

Examples

iex> documents = ["dog hat", "dog", "cat mat", "duck"]
iex> ExNlp.Ranking.TfIdf.calculate_all("nice dog", documents)
[{"nice", 0.8047189562170501}, {"dog", 0.25541281188299536}]

# Custom tokenizer
iex> tokenizer = fn text -> String.split(text, ",") |> Enum.map(&String.trim/1) end
iex> ExNlp.Ranking.TfIdf.calculate_all("nice,dog", ["dog,hat", "dog", "cat,mat"], tokenizer: tokenizer)
[{"nice", 0.6931471805599453}, {"dog", 0.14384103622589042}]

calculate_from_tokens(word, document, corpus, opts \\ [])

@spec calculate_from_tokens(String.t(), [String.t()], [[String.t()]], options()) ::
  float()

Calculates TF-IDF score from pre-processed token lists.

This is more efficient when you already have tokenized and processed documents, as it skips the tokenization and processing steps.

Arguments

word - The word to calculate TF-IDF for (should already be processed)
document - Pre-processed token list
corpus - List of pre-processed token lists
opts - Options (see options/0)

Examples

iex> processed_doc = ["nice", "dog", "dog"]
...> processed_corpus = [["dog", "hat"], ["dog"], ["cat", "mat"]]
...> ExNlp.Ranking.TfIdf.calculate_from_tokens("dog", processed_doc, processed_corpus)
0.19178804830118723

# With BM25 variant
iex> processed_doc2 = ["nice", "dog", "dog"]
...> processed_corpus2 = [["dog", "hat"], ["dog"], ["cat", "mat"]]
...> ExNlp.Ranking.TfIdf.calculate_from_tokens("dog", processed_doc2, processed_corpus2,
...>   tf_variant: :raw, idf_variant: :bm25
...> )
0.7133498878774648

document_term_frequencies(doc_tokens, variant \\ :normalized, sublinear \\ false)

@spec document_term_frequencies([String.t()], :normalized | :raw, boolean()) :: %{
  required(String.t()) => float() | non_neg_integer()
}

Calculate term frequencies for all terms in a document at once.

Returns a map of term => tf. More efficient than calling term_frequency/4 repeatedly for each term in a document.

Arguments

doc_tokens - List of tokens in the document
variant - :normalized (default) for normalized TF, :raw for raw count
sublinear - If true, apply log scaling (only for normalized variant)

Examples

iex> tokens = ["search", "engine", "search"]
iex> freqs = ExNlp.Ranking.TfIdf.document_term_frequencies(tokens)
iex> freqs["search"]
0.6666666666666666
iex> freqs["engine"]
0.3333333333333333

# Raw frequencies
iex> tokens = ["search", "engine", "search"]
iex> freqs = ExNlp.Ranking.TfIdf.document_term_frequencies(tokens, :raw)
iex> freqs["search"]
2
iex> freqs["engine"]
1

inverse_document_frequency(word, corpus, variant \\ :standard, smooth \\ false)

@spec inverse_document_frequency(
  String.t(),
  [[String.t()]],
  :standard | :bm25,
  boolean()
) :: float()

Returns the inverse document frequency (IDF) of a word in a corpus.

IDF measures how rare or common a word is across the entire corpus. Common words have lower IDF, rare words have higher IDF.

Arguments

word - The word to calculate IDF for
corpus - List of documents (each document is a list of tokens)
variant - :standard (default) for standard TF-IDF IDF, :bm25 for BM25 variant
smooth - If true, use smoothed IDF (only applies to :standard variant)

Examples

# Standard IDF
iex> corpus = [["dog", "hat"], ["dog"], ["cat", "mat"], ["duck"]]
iex> ExNlp.Ranking.TfIdf.inverse_document_frequency("dog", corpus)
0.6931471805599453

# BM25 variant IDF
iex> corpus = [["dog", "hat"], ["dog"], ["cat", "mat"]]
...> ExNlp.Ranking.TfIdf.inverse_document_frequency("dog", corpus, :bm25)
0.47000362924573563

# Smoothed standard IDF
iex> corpus = [["dog", "hat"], ["dog"], ["cat", "mat"]]
...> ExNlp.Ranking.TfIdf.inverse_document_frequency("dog", corpus, :standard, true)
1.2876820724517808

iex> corpus = [["dog", "hat"], ["dog"], ["cat", "mat"]]
...> ExNlp.Ranking.TfIdf.inverse_document_frequency("unique", corpus)
1.791759469228055

raw_term_frequency(word, document)

@spec raw_term_frequency(String.t(), [String.t()]) :: non_neg_integer()

Returns the raw term frequency (count) of a word in a document.

This is a convenience function. For consistency, use term_frequency/4 with variant: :raw instead.

Examples

iex> ExNlp.Ranking.TfIdf.raw_term_frequency("dog", ["nice", "dog", "dog"])
2

iex> ExNlp.Ranking.TfIdf.raw_term_frequency("cat", ["nice", "dog", "dog"])
0

score(documents, query, opts \\ [])

@spec score([String.t()] | [[String.t()]], [String.t()] | String.t(), options()) :: [
  float()
]

Scores documents against a query using TF-IDF.

Returns a list of scores, one for each document. Higher scores indicate greater relevance. The score is the sum of TF-IDF scores for all query terms.

Arguments

documents - List of document texts (strings) or pre-tokenized lists
query - Query as a list of keywords or a single string
opts - Options (see options/0)

Examples

iex> documents = [
...>   "The quick brown fox jumps over the lazy dog",
...>   "A brown dog is running in the park",
...>   "The fox and the dog are friends"
...> ]
iex> query = ["brown", "fox"]
iex> ExNlp.Ranking.TfIdf.score(documents, query)
[0.06392934943372908, 0.035960259056472606, 0.04109743892168297]

# Query as a single string
iex> documents = ["The quick brown fox", "A brown dog", "The fox and dog"]
...> ExNlp.Ranking.TfIdf.score(documents, "brown fox")
[0.14384103622589042, 0.09589402415059362, 0.07192051811294521]

score_document(document, query, corpus, opts \\ [])

@spec score_document(
  String.t() | [String.t()],
  [String.t()] | String.t(),
  [String.t()] | [[String.t()]],
  options()
) :: float()

Scores a single document against a query using TF-IDF.

Examples

iex> documents = ["The quick brown fox", "A brown dog"]
iex> query = ["brown", "fox"]
iex> ExNlp.Ranking.TfIdf.score_document("The quick brown fox", query, documents)
0.07192051811294521

# With options
iex> documents = ["The quick brown fox", "A brown dog"]
...> query = ["brown", "fox"]
...> ExNlp.Ranking.TfIdf.score_document("The quick brown fox", query, documents,
...>   tf_variant: :raw, idf_variant: :bm25
...> )
0.46203545959655873

score_document_from_tokens(document, query, corpus, opts \\ [])

@spec score_document_from_tokens(
  [String.t()],
  [String.t()],
  [[String.t()]],
  options()
) :: float()

Scores a single document from pre-processed tokens against a query.

Arguments

document - Pre-processed token list
query - Pre-processed token list
corpus - List of pre-processed token lists (for IDF calculation)
opts - Options (TF-IDF specific options are used)

Examples

iex> processed_doc = ["brown", "fox"]
iex> processed_query = ["brown", "fox"]
iex> processed_corpus = [["brown", "dog"], ["fox", "dog"]]
iex> ExNlp.Ranking.TfIdf.score_document_from_tokens(processed_doc, processed_query, processed_corpus)
0.28768207245178085

score_from_tokens(documents, query, opts \\ [])

@spec score_from_tokens([[String.t()]], [String.t()], options()) :: [float()]

Scores documents from pre-processed token lists using TF-IDF.

This is more efficient when you already have tokenized and processed documents, as it skips the tokenization and processing steps.

Arguments

documents - List of pre-processed token lists
query - Pre-processed token list
opts - Options (TF-IDF specific options are used, processing options ignored)

Examples

iex> processed_docs = [["brown", "fox"], ["brown", "dog"], ["fox", "dog"]]
iex> processed_query = ["brown", "fox"]
iex> ExNlp.Ranking.TfIdf.score_from_tokens(processed_docs, processed_query)
[0.28768207245178085, 0.14384103622589042, 0.14384103622589042]

term_frequency(word, document, variant \\ :normalized, sublinear \\ false)

@spec term_frequency(String.t(), [String.t()], :normalized | :raw, boolean()) ::
  float() | non_neg_integer()

Returns the term frequency (TF) of a word in a document.

Arguments

word - The word to calculate TF for
document - List of tokens representing the document
variant - :normalized (default) for normalized TF, :raw for raw count
sublinear - If true, apply log scaling: 1 + log(tf) (only for normalized)

Examples

# Normalized TF (default)
iex> ExNlp.Ranking.TfIdf.term_frequency("dog", ["nice", "dog", "dog"])
0.6666666666666666

# Raw TF count
iex> ExNlp.Ranking.TfIdf.term_frequency("dog", ["nice", "dog", "dog"], :raw)
2

# Sublinear normalized TF
iex> ExNlp.Ranking.TfIdf.term_frequency("dog", ["nice", "dog", "dog"], :normalized, true)
0.5945348918918356

iex> ExNlp.Ranking.TfIdf.term_frequency("cat", ["nice", "dog", "dog"])
0.0