ExNlp.Ranking.TfIdf (ex_nlp v0.1.0)
View SourceTerm Frequency-Inverse Document Frequency (TF-IDF) implementation.
TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.
This implementation integrates with ExNlp's tokenizers and stemmers for flexible text processing.
Examples
iex> documents = [
...> "The quick brown fox jumps over the lazy dog",
...> "A brown dog is running in the park",
...> "The fox and the dog are friends"
...> ]
iex> ExNlp.Ranking.TfIdf.calculate("fox", "The quick brown fox", documents)
0.07192051811294521
# With stemming
iex> documents = [
...> "The quick brown fox jumps over the lazy dog",
...> "A brown dog is running in the park",
...> "The fox and the dog are friends"
...> ]
...> ExNlp.Ranking.TfIdf.calculate("jump", "The fox is jumping", documents, stem: true, language: :english)
0.17328679513998632
# Calculate TF-IDF for all words in a document
iex> documents = [
...> "The quick brown fox jumps over the lazy dog",
...> "A brown dog is running in the park",
...> "The fox and the dog are friends"
...> ]
...> ExNlp.Ranking.TfIdf.calculate_all("The quick brown fox", documents)
[{"quick", 0.17328679513998632}, {"brown", 0.07192051811294521}, {"fox", 0.07192051811294521}, {"the", 0.0}]Reference: https://en.wikipedia.org/wiki/Tf-idf
Summary
Functions
Batch calculate IDF for multiple terms.
Calculates TF-IDF score for a word in a document within a corpus.
Calculates TF-IDF scores for all words in a document.
Calculates TF-IDF score from pre-processed token lists.
Calculate term frequencies for all terms in a document at once.
Returns the inverse document frequency (IDF) of a word in a corpus.
Returns the raw term frequency (count) of a word in a document.
Scores documents against a query using TF-IDF.
Scores a single document against a query using TF-IDF.
Scores a single document from pre-processed tokens against a query.
Scores documents from pre-processed token lists using TF-IDF.
Returns the term frequency (TF) of a word in a document.
Types
@type options() :: keyword() | ExNlp.Ranking.TfIdfOptions.t()
Options for TF-IDF calculation. Can be a keyword list or a TfIdfOptions struct.
Keyword list options:
:stem- Iftrue, apply stemming to words (default:false):language- Language for stemming (default::english):tokenizer- Custom tokenizer function (default: usesTokenizer.word_tokenize/1):remove_stopwords- Iftrue, remove stop words (default:false):stopword_language- Language for stop words (default::english):smooth_idf- Iftrue, use smoothed IDF (adds 1 to prevent division by zero, default:false):sublinear_tf- Iftrue, use log scaling for TF:1 + log(tf)instead of raw tf (default:false):normalize- Normalization method::l2,:l1, ornil(default:nil):tf_variant- Term frequency variant::normalizedor:raw(default::normalized):idf_variant- IDF variant::standardor:bm25(default::standard)
Alternatively, you can pass a TfIdfOptions struct created with TfIdfOptions.new/1.
Functions
Batch calculate IDF for multiple terms.
More efficient than calling inverse_document_frequency/4 repeatedly
when you need IDF values for many terms.
Arguments
terms- List of terms to calculate IDF forcorpus- List of tokenized documentsopts- Options (seeoptions/0)
Examples
iex> terms = ["search", "engine"]
iex> corpus = [["search", "engine"], ["engine", "query"]]
iex> idf_map = ExNlp.Ranking.TfIdf.batch_idf(terms, corpus)
iex> Map.has_key?(idf_map, "search")
true
iex> idf_map["engine"]
0.0
@spec calculate( String.t(), String.t() | [String.t()], [String.t()] | [[String.t()]], options() ) :: float()
Calculates TF-IDF score for a word in a document within a corpus.
Arguments
word- The word to calculate TF-IDF fordocument- The document text (string) or pre-tokenized listcorpus- List of documents (strings or pre-tokenized lists)opts- Options (seeoptions/0)
Examples
iex> documents = ["dog hat", "dog", "cat mat", "duck"]
iex> ExNlp.Ranking.TfIdf.calculate("dog", "nice dog dog", documents)
0.3405504158439938
# With tokenized input
iex> tokenized_doc = ["nice", "dog", "dog"]
iex> tokenized_corpus = [["dog", "hat"], ["dog"], ["cat", "mat"], ["duck"]]
iex> ExNlp.Ranking.TfIdf.calculate("dog", tokenized_doc, tokenized_corpus)
0.3405504158439938
# With pre-processed tokens (faster, skips tokenization/processing)
iex> processed_doc = ["nice", "dog", "dog"]
iex> processed_corpus = [["dog", "hat"], ["dog"], ["cat", "mat"]]
iex> ExNlp.Ranking.TfIdf.calculate_from_tokens("dog", processed_doc, processed_corpus)
0.19178804830118723
@spec calculate_all( String.t() | [String.t()], [String.t()] | [[String.t()]], options() ) :: [ {String.t(), float()} ]
Calculates TF-IDF scores for all words in a document.
Returns a list of {word, score} tuples sorted by score (descending).
Note: This function is slower than calculate_from_tokens/4 because it performs
tokenization and processing. For better performance, use pre-processed tokens
with calculate_from_tokens/4 or process tokens once and reuse them.
Examples
iex> documents = ["dog hat", "dog", "cat mat", "duck"]
iex> ExNlp.Ranking.TfIdf.calculate_all("nice dog", documents)
[{"nice", 0.8047189562170501}, {"dog", 0.25541281188299536}]
# Custom tokenizer
iex> tokenizer = fn text -> String.split(text, ",") |> Enum.map(&String.trim/1) end
iex> ExNlp.Ranking.TfIdf.calculate_all("nice,dog", ["dog,hat", "dog", "cat,mat"], tokenizer: tokenizer)
[{"nice", 0.6931471805599453}, {"dog", 0.14384103622589042}]
Calculates TF-IDF score from pre-processed token lists.
This is more efficient when you already have tokenized and processed documents, as it skips the tokenization and processing steps.
Arguments
word- The word to calculate TF-IDF for (should already be processed)document- Pre-processed token listcorpus- List of pre-processed token listsopts- Options (seeoptions/0)
Examples
iex> processed_doc = ["nice", "dog", "dog"]
...> processed_corpus = [["dog", "hat"], ["dog"], ["cat", "mat"]]
...> ExNlp.Ranking.TfIdf.calculate_from_tokens("dog", processed_doc, processed_corpus)
0.19178804830118723
# With BM25 variant
iex> processed_doc2 = ["nice", "dog", "dog"]
...> processed_corpus2 = [["dog", "hat"], ["dog"], ["cat", "mat"]]
...> ExNlp.Ranking.TfIdf.calculate_from_tokens("dog", processed_doc2, processed_corpus2,
...> tf_variant: :raw, idf_variant: :bm25
...> )
0.7133498878774648
@spec document_term_frequencies([String.t()], :normalized | :raw, boolean()) :: %{ required(String.t()) => float() | non_neg_integer() }
Calculate term frequencies for all terms in a document at once.
Returns a map of term => tf. More efficient than calling
term_frequency/4 repeatedly for each term in a document.
Arguments
doc_tokens- List of tokens in the documentvariant-:normalized(default) for normalized TF,:rawfor raw countsublinear- Iftrue, apply log scaling (only for normalized variant)
Examples
iex> tokens = ["search", "engine", "search"]
iex> freqs = ExNlp.Ranking.TfIdf.document_term_frequencies(tokens)
iex> freqs["search"]
0.6666666666666666
iex> freqs["engine"]
0.3333333333333333
# Raw frequencies
iex> tokens = ["search", "engine", "search"]
iex> freqs = ExNlp.Ranking.TfIdf.document_term_frequencies(tokens, :raw)
iex> freqs["search"]
2
iex> freqs["engine"]
1
@spec inverse_document_frequency( String.t(), [[String.t()]], :standard | :bm25, boolean() ) :: float()
Returns the inverse document frequency (IDF) of a word in a corpus.
IDF measures how rare or common a word is across the entire corpus. Common words have lower IDF, rare words have higher IDF.
Arguments
word- The word to calculate IDF forcorpus- List of documents (each document is a list of tokens)variant-:standard(default) for standard TF-IDF IDF,:bm25for BM25 variantsmooth- Iftrue, use smoothed IDF (only applies to:standardvariant)
Examples
# Standard IDF
iex> corpus = [["dog", "hat"], ["dog"], ["cat", "mat"], ["duck"]]
iex> ExNlp.Ranking.TfIdf.inverse_document_frequency("dog", corpus)
0.6931471805599453
# BM25 variant IDF
iex> corpus = [["dog", "hat"], ["dog"], ["cat", "mat"]]
...> ExNlp.Ranking.TfIdf.inverse_document_frequency("dog", corpus, :bm25)
0.47000362924573563
# Smoothed standard IDF
iex> corpus = [["dog", "hat"], ["dog"], ["cat", "mat"]]
...> ExNlp.Ranking.TfIdf.inverse_document_frequency("dog", corpus, :standard, true)
1.2876820724517808
iex> corpus = [["dog", "hat"], ["dog"], ["cat", "mat"]]
...> ExNlp.Ranking.TfIdf.inverse_document_frequency("unique", corpus)
1.791759469228055
@spec raw_term_frequency(String.t(), [String.t()]) :: non_neg_integer()
Returns the raw term frequency (count) of a word in a document.
This is a convenience function. For consistency, use term_frequency/4 with
variant: :raw instead.
Examples
iex> ExNlp.Ranking.TfIdf.raw_term_frequency("dog", ["nice", "dog", "dog"])
2
iex> ExNlp.Ranking.TfIdf.raw_term_frequency("cat", ["nice", "dog", "dog"])
0
Scores documents against a query using TF-IDF.
Returns a list of scores, one for each document. Higher scores indicate greater relevance. The score is the sum of TF-IDF scores for all query terms.
Arguments
documents- List of document texts (strings) or pre-tokenized listsquery- Query as a list of keywords or a single stringopts- Options (seeoptions/0)
Examples
iex> documents = [
...> "The quick brown fox jumps over the lazy dog",
...> "A brown dog is running in the park",
...> "The fox and the dog are friends"
...> ]
iex> query = ["brown", "fox"]
iex> ExNlp.Ranking.TfIdf.score(documents, query)
[0.06392934943372908, 0.035960259056472606, 0.04109743892168297]
# Query as a single string
iex> documents = ["The quick brown fox", "A brown dog", "The fox and dog"]
...> ExNlp.Ranking.TfIdf.score(documents, "brown fox")
[0.14384103622589042, 0.09589402415059362, 0.07192051811294521]
@spec score_document( String.t() | [String.t()], [String.t()] | String.t(), [String.t()] | [[String.t()]], options() ) :: float()
Scores a single document against a query using TF-IDF.
Examples
iex> documents = ["The quick brown fox", "A brown dog"]
iex> query = ["brown", "fox"]
iex> ExNlp.Ranking.TfIdf.score_document("The quick brown fox", query, documents)
0.07192051811294521
# With options
iex> documents = ["The quick brown fox", "A brown dog"]
...> query = ["brown", "fox"]
...> ExNlp.Ranking.TfIdf.score_document("The quick brown fox", query, documents,
...> tf_variant: :raw, idf_variant: :bm25
...> )
0.46203545959655873
@spec score_document_from_tokens( [String.t()], [String.t()], [[String.t()]], options() ) :: float()
Scores a single document from pre-processed tokens against a query.
Arguments
document- Pre-processed token listquery- Pre-processed token listcorpus- List of pre-processed token lists (for IDF calculation)opts- Options (TF-IDF specific options are used)
Examples
iex> processed_doc = ["brown", "fox"]
iex> processed_query = ["brown", "fox"]
iex> processed_corpus = [["brown", "dog"], ["fox", "dog"]]
iex> ExNlp.Ranking.TfIdf.score_document_from_tokens(processed_doc, processed_query, processed_corpus)
0.28768207245178085
Scores documents from pre-processed token lists using TF-IDF.
This is more efficient when you already have tokenized and processed documents, as it skips the tokenization and processing steps.
Arguments
documents- List of pre-processed token listsquery- Pre-processed token listopts- Options (TF-IDF specific options are used, processing options ignored)
Examples
iex> processed_docs = [["brown", "fox"], ["brown", "dog"], ["fox", "dog"]]
iex> processed_query = ["brown", "fox"]
iex> ExNlp.Ranking.TfIdf.score_from_tokens(processed_docs, processed_query)
[0.28768207245178085, 0.14384103622589042, 0.14384103622589042]
@spec term_frequency(String.t(), [String.t()], :normalized | :raw, boolean()) :: float() | non_neg_integer()
Returns the term frequency (TF) of a word in a document.
Arguments
word- The word to calculate TF fordocument- List of tokens representing the documentvariant-:normalized(default) for normalized TF,:rawfor raw countsublinear- Iftrue, apply log scaling:1 + log(tf)(only for normalized)
Examples
# Normalized TF (default)
iex> ExNlp.Ranking.TfIdf.term_frequency("dog", ["nice", "dog", "dog"])
0.6666666666666666
# Raw TF count
iex> ExNlp.Ranking.TfIdf.term_frequency("dog", ["nice", "dog", "dog"], :raw)
2
# Sublinear normalized TF
iex> ExNlp.Ranking.TfIdf.term_frequency("dog", ["nice", "dog", "dog"], :normalized, true)
0.5945348918918356
iex> ExNlp.Ranking.TfIdf.term_frequency("cat", ["nice", "dog", "dog"])
0.0