Text.IR.Corpus (Text v0.5.0)

Copy Markdown View Source

An indexed corpus of documents for information-retrieval scoring.

Wraps a list of documents in the precomputed statistics that TF-IDF and BM25 need: document frequencies, term frequencies, document lengths, and average document length. Build once with new/2, then query repeatedly via Text.IR.tfidf/3, Text.IR.bm25/4, or Text.IR.search/3.

Tokenisation

By default, documents are split into terms with Text.Segment.words/1 and case-folded. Pass :tokenizer to override (any function from String.t() -> [String.t()]) and :fold_case to disable lowercasing.

Document identifiers

Each document is referenced by its zero-based index in the input list. The index is stable for the lifetime of the corpus struct. Original document text is retained for downstream highlighting and KWIC display.

Summary

Types

Zero-based document index.

t()

A term — typically a single word.

Functions

Builds an indexed corpus from a list of documents.

Returns the corpus's view of a query — tokens after the same pre-processing applied at index time.

Types

doc_id()

@type doc_id() :: non_neg_integer()

Zero-based document index.

t()

@type t() :: %Text.IR.Corpus{
  avg_doc_length: float(),
  doc_lengths: %{required(doc_id()) => non_neg_integer()},
  document_frequencies: %{required(term_string()) => pos_integer()},
  documents: %{required(doc_id()) => String.t()},
  fold_case: boolean(),
  n_docs: non_neg_integer(),
  term_frequencies: %{
    required(doc_id()) => %{required(term_string()) => pos_integer()}
  },
  tokenizer: (String.t() -> [String.t()])
}

term_string()

@type term_string() :: String.t()

A term — typically a single word.

Functions

new(documents, options \\ [])

@spec new(
  [String.t()],
  keyword()
) :: t()

Builds an indexed corpus from a list of documents.

Arguments

Options

  • :tokenizer — a one-arg function from String.t/0 to [t:String.t/0]. Defaults to &Text.Segment.words/1.

  • :fold_case — when true (default), terms are lowercased so the index is case-insensitive. Set false to preserve case.

Returns

Examples

iex> docs = ["the cat sat", "the dog sat", "the dog ran"]
iex> corpus = Text.IR.Corpus.new(docs)
iex> corpus.n_docs
3
iex> corpus.avg_doc_length
3.0
iex> Map.get(corpus.document_frequencies, "the")
3
iex> Map.get(corpus.document_frequencies, "ran")
1

tokenize_query(corpus, query)

@spec tokenize_query(t(), String.t()) :: [term_string()]

Returns the corpus's view of a query — tokens after the same pre-processing applied at index time.

Useful when assembling a query vector for Text.IR.bm25/4 or any scoring function that needs the same tokenisation as the corpus.

Examples

iex> corpus = Text.IR.Corpus.new(["one two three"])
iex> Text.IR.Corpus.tokenize_query(corpus, "TWO three!")
["two", "three"]