Text.IR.Corpus (Text v0.5.0)

An indexed corpus of documents for information-retrieval scoring.

Wraps a list of documents in the precomputed statistics that TF-IDF and BM25 need: document frequencies, term frequencies, document lengths, and average document length. Build once with new/2, then query repeatedly via Text.IR.tfidf/3, Text.IR.bm25/4, or Text.IR.search/3.

Tokenisation

By default, documents are split into terms with Text.Segment.words/1 and case-folded. Pass :tokenizer to override (any function from String.t() -> [String.t()]) and :fold_case to disable lowercasing.

Document identifiers

Each document is referenced by its zero-based index in the input list. The index is stable for the lifetime of the corpus struct. Original document text is retained for downstream highlighting and KWIC display.

Summary

Types

doc_id()

Zero-based document index.

t()

term_string()

A term — typically a single word.

Functions

new(documents, options \\ [])

Builds an indexed corpus from a list of documents.

tokenize_query(corpus, query)

Returns the corpus's view of a query — tokens after the same pre-processing applied at index time.

Types

doc_id()

@type doc_id() :: non_neg_integer()

Zero-based document index.

t()

@type t() :: %Text.IR.Corpus{
  avg_doc_length: float(),
  doc_lengths: %{required(doc_id()) => non_neg_integer()},
  document_frequencies: %{required(term_string()) => pos_integer()},
  documents: %{required(doc_id()) => String.t()},
  fold_case: boolean(),
  n_docs: non_neg_integer(),
  term_frequencies: %{
    required(doc_id()) => %{required(term_string()) => pos_integer()}
  },
  tokenizer: (String.t() -> [String.t()])
}

term_string()

@type term_string() :: String.t()

A term — typically a single word.

Functions

new(documents, options \\ [])

@spec new(
  [String.t()],
  keyword()
) :: t()

Builds an indexed corpus from a list of documents.

Arguments

documents is a list of String.t/0 documents.

Options

:tokenizer — a one-arg function from String.t/0 to [t:String.t/0]. Defaults to &Text.Segment.words/1.
:fold_case — when true (default), terms are lowercased so the index is case-insensitive. Set false to preserve case.

Returns

A t/0 struct.

Examples

iex> docs = ["the cat sat", "the dog sat", "the dog ran"]
iex> corpus = Text.IR.Corpus.new(docs)
iex> corpus.n_docs
3
iex> corpus.avg_doc_length
3.0
iex> Map.get(corpus.document_frequencies, "the")
3
iex> Map.get(corpus.document_frequencies, "ran")
1

tokenize_query(corpus, query)

@spec tokenize_query(t(), String.t()) :: [term_string()]

Returns the corpus's view of a query — tokens after the same pre-processing applied at index time.

Useful when assembling a query vector for Text.IR.bm25/4 or any scoring function that needs the same tokenisation as the corpus.

Examples

iex> corpus = Text.IR.Corpus.new(["one two three"])
iex> Text.IR.Corpus.tokenize_query(corpus, "TWO three!")
["two", "three"]