# `Text.IR.Corpus`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/ir/corpus.ex#L1)

An indexed corpus of documents for information-retrieval scoring.

Wraps a list of documents in the precomputed statistics that TF-IDF
and BM25 need: document frequencies, term frequencies, document
lengths, and average document length. Build once with `new/2`, then
query repeatedly via `Text.IR.tfidf/3`, `Text.IR.bm25/4`, or
`Text.IR.search/3`.

### Tokenisation

By default, documents are split into terms with `Text.Segment.words/1`
and case-folded. Pass `:tokenizer` to override (any function from
`String.t() -> [String.t()]`) and `:fold_case` to disable lowercasing.

### Document identifiers

Each document is referenced by its zero-based index in the input
list. The index is stable for the lifetime of the corpus struct.
Original document text is retained for downstream highlighting and
KWIC display.

# `doc_id`

```elixir
@type doc_id() :: non_neg_integer()
```

Zero-based document index.

# `t`

```elixir
@type t() :: %Text.IR.Corpus{
  avg_doc_length: float(),
  doc_lengths: %{required(doc_id()) =&gt; non_neg_integer()},
  document_frequencies: %{required(term_string()) =&gt; pos_integer()},
  documents: %{required(doc_id()) =&gt; String.t()},
  fold_case: boolean(),
  n_docs: non_neg_integer(),
  term_frequencies: %{
    required(doc_id()) =&gt; %{required(term_string()) =&gt; pos_integer()}
  },
  tokenizer: (String.t() -&gt; [String.t()])
}
```

# `term_string`

```elixir
@type term_string() :: String.t()
```

A term — typically a single word.

# `new`

```elixir
@spec new(
  [String.t()],
  keyword()
) :: t()
```

Builds an indexed corpus from a list of documents.

### Arguments

* `documents` is a list of `t:String.t/0` documents.

### Options

* `:tokenizer` — a one-arg function from `t:String.t/0` to `[t:String.t/0]`.
  Defaults to `&Text.Segment.words/1`.

* `:fold_case` — when `true` (default), terms are lowercased so the
  index is case-insensitive. Set `false` to preserve case.

### Returns

* A `t:t/0` struct.

### Examples

    iex> docs = ["the cat sat", "the dog sat", "the dog ran"]
    iex> corpus = Text.IR.Corpus.new(docs)
    iex> corpus.n_docs
    3
    iex> corpus.avg_doc_length
    3.0
    iex> Map.get(corpus.document_frequencies, "the")
    3
    iex> Map.get(corpus.document_frequencies, "ran")
    1

# `tokenize_query`

```elixir
@spec tokenize_query(t(), String.t()) :: [term_string()]
```

Returns the corpus's view of a query — tokens after the same
pre-processing applied at index time.

Useful when assembling a query vector for `Text.IR.bm25/4` or any
scoring function that needs the same tokenisation as the corpus.

### Examples

    iex> corpus = Text.IR.Corpus.new(["one two three"])
    iex> Text.IR.Corpus.tokenize_query(corpus, "TWO three!")
    ["two", "three"]

---

*Consult [api-reference.md](api-reference.md) for complete listing*
