An indexed corpus of documents for information-retrieval scoring.
Wraps a list of documents in the precomputed statistics that TF-IDF
and BM25 need: document frequencies, term frequencies, document
lengths, and average document length. Build once with new/2, then
query repeatedly via Text.IR.tfidf/3, Text.IR.bm25/4, or
Text.IR.search/3.
Tokenisation
By default, documents are split into terms with Text.Segment.words/1
and case-folded. Pass :tokenizer to override (any function from
String.t() -> [String.t()]) and :fold_case to disable lowercasing.
Document identifiers
Each document is referenced by its zero-based index in the input list. The index is stable for the lifetime of the corpus struct. Original document text is retained for downstream highlighting and KWIC display.
Summary
Functions
Builds an indexed corpus from a list of documents.
Returns the corpus's view of a query — tokens after the same pre-processing applied at index time.
Types
@type doc_id() :: non_neg_integer()
Zero-based document index.
@type t() :: %Text.IR.Corpus{ avg_doc_length: float(), doc_lengths: %{required(doc_id()) => non_neg_integer()}, document_frequencies: %{required(term_string()) => pos_integer()}, documents: %{required(doc_id()) => String.t()}, fold_case: boolean(), n_docs: non_neg_integer(), term_frequencies: %{ required(doc_id()) => %{required(term_string()) => pos_integer()} }, tokenizer: (String.t() -> [String.t()]) }
@type term_string() :: String.t()
A term — typically a single word.
Functions
Builds an indexed corpus from a list of documents.
Arguments
documentsis a list ofString.t/0documents.
Options
:tokenizer— a one-arg function fromString.t/0to[t:String.t/0]. Defaults to&Text.Segment.words/1.:fold_case— whentrue(default), terms are lowercased so the index is case-insensitive. Setfalseto preserve case.
Returns
- A
t/0struct.
Examples
iex> docs = ["the cat sat", "the dog sat", "the dog ran"]
iex> corpus = Text.IR.Corpus.new(docs)
iex> corpus.n_docs
3
iex> corpus.avg_doc_length
3.0
iex> Map.get(corpus.document_frequencies, "the")
3
iex> Map.get(corpus.document_frequencies, "ran")
1
@spec tokenize_query(t(), String.t()) :: [term_string()]
Returns the corpus's view of a query — tokens after the same pre-processing applied at index time.
Useful when assembling a query vector for Text.IR.bm25/4 or any
scoring function that needs the same tokenisation as the corpus.
Examples
iex> corpus = Text.IR.Corpus.new(["one two three"])
iex> Text.IR.Corpus.tokenize_query(corpus, "TWO three!")
["two", "three"]