Text.IR (Text v0.5.0)

Copy Markdown View Source

Information-retrieval scoring against an indexed corpus.

Two scoring functions are provided:

  • TF-IDF (tfidf/3) — the classical term frequency × inverse document frequency. Useful as a feature for clustering and classification, and as a fast first-pass relevance signal.

  • BM25 (bm25/4) — Okapi BM25, the de-facto standard probabilistic relevance ranking used by Lucene/Elasticsearch and most modern search engines. Strictly better than TF-IDF for ranking results to a query.

Both score functions consume a Text.IR.Corpus built once with Text.IR.Corpus.new/2. The corpus precomputes document frequencies, term frequencies, and document lengths so per-query scoring is O(query terms × matching documents).

TF-IDF formula

tf-idf(t, d) = tf(t, d) · log(N / df(t))

where tf(t, d) is the count of term t in document d, N is the total number of documents in the corpus, and df(t) is the number of documents containing t. This module uses raw term frequency (not log-normalised) and the smooth IDF variant log((N + 1) / (df + 1)) + 1 so the score is non-negative even when a query term occurs in every document.

BM25 formula

score(d, q) = Σ over t in q: idf(t) · (tf · (k1 + 1)) /
                              (tf + k1 · (1 - b + b · |d|/avgdl))

with the smooth IDF log((N - df + 0.5) / (df + 0.5) + 1) (Lucene's variant — guarantees non-negative IDF) and parameters k1 = 1.2, b = 0.75 by default.

Summary

Functions

Returns the BM25 score for the entire query against document doc_id.

Returns the top-K documents matching query, ranked by score.

Returns the TF-IDF score for term in document doc_id of the given corpus.

Functions

bm25(corpus, doc_id, query, options \\ [])

Returns the BM25 score for the entire query against document doc_id.

Arguments

  • corpus is a Text.IR.Corpus.

  • doc_id is the zero-based document index.

  • query is either a string (which is tokenised through the corpus's tokenizer) or a list of pre-tokenised terms.

Options

  • :k1 — saturation parameter, defaults to 1.2. Higher values make repeated occurrences of the same term in a document weigh more heavily.

  • :b — length-normalisation parameter, defaults to 0.75. 0.0 disables length normalisation; 1.0 normalises fully.

Returns

  • A non-negative float. 0.0 if no query term appears in the document.

Examples

iex> docs = ["the cat sat on the mat", "the dog sat on the log", "the cat ran"]
iex> corpus = Text.IR.Corpus.new(docs)
iex> Text.IR.bm25(corpus, 0, "cat sat") > Text.IR.bm25(corpus, 1, "cat sat")
true

search(corpus, query, options \\ [])

Returns the top-K documents matching query, ranked by score.

Arguments

  • corpus is a Text.IR.Corpus.

  • query is a string or pre-tokenised list of terms.

Options

  • :scorer:bm25 (default) or :tfidf.

  • :k — number of results to return. Defaults to 10.

  • :k1, :b — passed through to BM25 when scorer: :bm25.

Returns

  • A list of {doc_id, score} pairs in descending score order. Documents with score 0.0 are excluded.

Examples

iex> docs = [
...>   "the cat sat on the mat",
...>   "the dog sat on the log",
...>   "elephants are large"
...> ]
iex> corpus = Text.IR.Corpus.new(docs)
iex> [{best_id, _score} | _] = Text.IR.search(corpus, "cat", k: 3)
iex> best_id
0

tfidf(corpus, doc_id, term)

Returns the TF-IDF score for term in document doc_id of the given corpus.

Arguments

  • corpus is a Text.IR.Corpus.

  • doc_id is the zero-based document index.

  • term is the term string to score. The term is folded to match the corpus's case-folding setting before lookup.

Returns

  • A non-negative float. Returns 0.0 when the term doesn't appear in the document.

Examples

iex> docs = ["the cat sat", "the dog sat", "a fox ran"]
iex> corpus = Text.IR.Corpus.new(docs)
iex> Text.IR.tfidf(corpus, 0, "cat") > 0.0
true

iex> docs = ["the cat sat", "the dog sat", "a fox ran"]
iex> corpus = Text.IR.Corpus.new(docs)
iex> Text.IR.tfidf(corpus, 0, "missing")
0.0