# `Text.IR`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/ir.ex#L1)

Information-retrieval scoring against an indexed corpus.

Two scoring functions are provided:

* **TF-IDF** (`tfidf/3`) — the classical term frequency × inverse
  document frequency. Useful as a feature for clustering and
  classification, and as a fast first-pass relevance signal.

* **BM25** (`bm25/4`) — Okapi BM25, the de-facto standard probabilistic
  relevance ranking used by Lucene/Elasticsearch and most modern
  search engines. Strictly better than TF-IDF for ranking results to
  a query.

Both score functions consume a `Text.IR.Corpus` built once with
`Text.IR.Corpus.new/2`. The corpus precomputes document frequencies,
term frequencies, and document lengths so per-query scoring is
O(query terms × matching documents).

### TF-IDF formula

    tf-idf(t, d) = tf(t, d) · log(N / df(t))

where `tf(t, d)` is the count of term `t` in document `d`, `N` is
the total number of documents in the corpus, and `df(t)` is the
number of documents containing `t`. This module uses raw term
frequency (not log-normalised) and the smooth IDF variant
`log((N + 1) / (df + 1)) + 1` so the score is non-negative even
when a query term occurs in every document.

### BM25 formula

    score(d, q) = Σ over t in q: idf(t) · (tf · (k1 + 1)) /
                                  (tf + k1 · (1 - b + b · |d|/avgdl))

with the smooth IDF `log((N - df + 0.5) / (df + 0.5) + 1)` (Lucene's
variant — guarantees non-negative IDF) and parameters `k1 = 1.2`,
`b = 0.75` by default.

# `bm25`

```elixir
@spec bm25(
  Text.IR.Corpus.t(),
  Text.IR.Corpus.doc_id(),
  String.t() | [Text.IR.Corpus.term_string()],
  keyword()
) :: float()
```

Returns the BM25 score for the entire query against document `doc_id`.

### Arguments

* `corpus` is a `Text.IR.Corpus`.

* `doc_id` is the zero-based document index.

* `query` is either a string (which is tokenised through the
  corpus's tokenizer) or a list of pre-tokenised terms.

### Options

* `:k1` — saturation parameter, defaults to `1.2`. Higher values
  make repeated occurrences of the same term in a document weigh
  more heavily.

* `:b` — length-normalisation parameter, defaults to `0.75`.
  `0.0` disables length normalisation; `1.0` normalises fully.

### Returns

* A non-negative float. `0.0` if no query term appears in the
  document.

### Examples

    iex> docs = ["the cat sat on the mat", "the dog sat on the log", "the cat ran"]
    iex> corpus = Text.IR.Corpus.new(docs)
    iex> Text.IR.bm25(corpus, 0, "cat sat") > Text.IR.bm25(corpus, 1, "cat sat")
    true

# `search`

```elixir
@spec search(
  Text.IR.Corpus.t(),
  String.t() | [Text.IR.Corpus.term_string()],
  keyword()
) :: [
  {Text.IR.Corpus.doc_id(), float()}
]
```

Returns the top-K documents matching `query`, ranked by score.

### Arguments

* `corpus` is a `Text.IR.Corpus`.

* `query` is a string or pre-tokenised list of terms.

### Options

* `:scorer` — `:bm25` (default) or `:tfidf`.

* `:k` — number of results to return. Defaults to `10`.

* `:k1`, `:b` — passed through to BM25 when `scorer: :bm25`.

### Returns

* A list of `{doc_id, score}` pairs in descending score order. Documents
  with score `0.0` are excluded.

### Examples

    iex> docs = [
    ...>   "the cat sat on the mat",
    ...>   "the dog sat on the log",
    ...>   "elephants are large"
    ...> ]
    iex> corpus = Text.IR.Corpus.new(docs)
    iex> [{best_id, _score} | _] = Text.IR.search(corpus, "cat", k: 3)
    iex> best_id
    0

# `tfidf`

```elixir
@spec tfidf(Text.IR.Corpus.t(), Text.IR.Corpus.doc_id(), Text.IR.Corpus.term_string()) ::
  float()
```

Returns the TF-IDF score for `term` in document `doc_id` of the
given corpus.

### Arguments

* `corpus` is a `Text.IR.Corpus`.

* `doc_id` is the zero-based document index.

* `term` is the term string to score. The term is folded to match
  the corpus's case-folding setting before lookup.

### Returns

* A non-negative float. Returns `0.0` when the term doesn't appear
  in the document.

### Examples

    iex> docs = ["the cat sat", "the dog sat", "a fox ran"]
    iex> corpus = Text.IR.Corpus.new(docs)
    iex> Text.IR.tfidf(corpus, 0, "cat") > 0.0
    true

    iex> docs = ["the cat sat", "the dog sat", "a fox ran"]
    iex> corpus = Text.IR.Corpus.new(docs)
    iex> Text.IR.tfidf(corpus, 0, "missing")
    0.0

---

*Consult [api-reference.md](api-reference.md) for complete listing*
