Information-retrieval scoring against an indexed corpus.
Two scoring functions are provided:
TF-IDF (
tfidf/3) — the classical term frequency × inverse document frequency. Useful as a feature for clustering and classification, and as a fast first-pass relevance signal.BM25 (
bm25/4) — Okapi BM25, the de-facto standard probabilistic relevance ranking used by Lucene/Elasticsearch and most modern search engines. Strictly better than TF-IDF for ranking results to a query.
Both score functions consume a Text.IR.Corpus built once with
Text.IR.Corpus.new/2. The corpus precomputes document frequencies,
term frequencies, and document lengths so per-query scoring is
O(query terms × matching documents).
TF-IDF formula
tf-idf(t, d) = tf(t, d) · log(N / df(t))where tf(t, d) is the count of term t in document d, N is
the total number of documents in the corpus, and df(t) is the
number of documents containing t. This module uses raw term
frequency (not log-normalised) and the smooth IDF variant
log((N + 1) / (df + 1)) + 1 so the score is non-negative even
when a query term occurs in every document.
BM25 formula
score(d, q) = Σ over t in q: idf(t) · (tf · (k1 + 1)) /
(tf + k1 · (1 - b + b · |d|/avgdl))with the smooth IDF log((N - df + 0.5) / (df + 0.5) + 1) (Lucene's
variant — guarantees non-negative IDF) and parameters k1 = 1.2,
b = 0.75 by default.
Summary
Functions
Returns the BM25 score for the entire query against document doc_id.
Returns the top-K documents matching query, ranked by score.
Returns the TF-IDF score for term in document doc_id of the
given corpus.
Functions
@spec bm25( Text.IR.Corpus.t(), Text.IR.Corpus.doc_id(), String.t() | [Text.IR.Corpus.term_string()], keyword() ) :: float()
Returns the BM25 score for the entire query against document doc_id.
Arguments
corpusis aText.IR.Corpus.doc_idis the zero-based document index.queryis either a string (which is tokenised through the corpus's tokenizer) or a list of pre-tokenised terms.
Options
:k1— saturation parameter, defaults to1.2. Higher values make repeated occurrences of the same term in a document weigh more heavily.:b— length-normalisation parameter, defaults to0.75.0.0disables length normalisation;1.0normalises fully.
Returns
- A non-negative float.
0.0if no query term appears in the document.
Examples
iex> docs = ["the cat sat on the mat", "the dog sat on the log", "the cat ran"]
iex> corpus = Text.IR.Corpus.new(docs)
iex> Text.IR.bm25(corpus, 0, "cat sat") > Text.IR.bm25(corpus, 1, "cat sat")
true
@spec search( Text.IR.Corpus.t(), String.t() | [Text.IR.Corpus.term_string()], keyword() ) :: [ {Text.IR.Corpus.doc_id(), float()} ]
Returns the top-K documents matching query, ranked by score.
Arguments
corpusis aText.IR.Corpus.queryis a string or pre-tokenised list of terms.
Options
:scorer—:bm25(default) or:tfidf.:k— number of results to return. Defaults to10.:k1,:b— passed through to BM25 whenscorer: :bm25.
Returns
- A list of
{doc_id, score}pairs in descending score order. Documents with score0.0are excluded.
Examples
iex> docs = [
...> "the cat sat on the mat",
...> "the dog sat on the log",
...> "elephants are large"
...> ]
iex> corpus = Text.IR.Corpus.new(docs)
iex> [{best_id, _score} | _] = Text.IR.search(corpus, "cat", k: 3)
iex> best_id
0
@spec tfidf(Text.IR.Corpus.t(), Text.IR.Corpus.doc_id(), Text.IR.Corpus.term_string()) :: float()
Returns the TF-IDF score for term in document doc_id of the
given corpus.
Arguments
corpusis aText.IR.Corpus.doc_idis the zero-based document index.termis the term string to score. The term is folded to match the corpus's case-folding setting before lookup.
Returns
- A non-negative float. Returns
0.0when the term doesn't appear in the document.
Examples
iex> docs = ["the cat sat", "the dog sat", "a fox ran"]
iex> corpus = Text.IR.Corpus.new(docs)
iex> Text.IR.tfidf(corpus, 0, "cat") > 0.0
true
iex> docs = ["the cat sat", "the dog sat", "a fox ran"]
iex> corpus = Text.IR.Corpus.new(docs)
iex> Text.IR.tfidf(corpus, 0, "missing")
0.0