Text.KWIC (Text v0.5.0)

Copy Markdown View Source

Keyword-In-Context concordance.

Given a piece of text and a search term, returns each occurrence of the term with a window of surrounding tokens on either side. The classic concordancing tool from corpus linguistics, useful for inspecting how a word is actually used in a corpus, building glossaries, and debugging tokenisation.

Example display rendering:

"the quick brown" | "fox" | "jumped over the"
"the lazy red"    | "fox" | "ran past the"

Each match is returned as a Text.KWIC.Match struct carrying the pre-context, the matched token (in its original casing), and the post-context. Use format/2 to turn a match into a readable string with the term centred and visually delimited.

Summary

Functions

Returns every occurrence of term in text with surrounding context.

Renders a Match as a readable concordance line.

Functions

concordance(text, term, options \\ [])

@spec concordance(String.t(), String.t(), keyword()) :: [Text.KWIC.Match.t()]

Returns every occurrence of term in text with surrounding context.

Arguments

  • text is a UTF-8 string.

  • term is the search term — a single token (e.g. "cat"). Multi-word phrases are not yet supported.

Options

  • :context — number of tokens of context on each side. Defaults to 5.

  • :case_sensitive — when false (default), the search is case-insensitive. The output preserves original casing regardless.

  • :tokenizer — a string-to-tokens function. Defaults to &Text.Segment.words/1.

Returns

  • A list of Text.KWIC.Match structs in document order. Returns [] if the term is not found.

Examples

iex> matches = Text.KWIC.concordance("the cat sat on the mat", "cat", context: 2)
iex> match = hd(matches)
iex> match.term
"cat"
iex> match.left
["the"]
iex> match.right
["sat", "on"]
iex> match.position
1

iex> Text.KWIC.concordance("no matches here", "missing")
[]

format(match, options \\ [])

@spec format(
  Text.KWIC.Match.t(),
  keyword()
) :: String.t()

Renders a Match as a readable concordance line.

Arguments

Options

  • :separator — string placed between the three sections. Defaults to " | ".

  • :width — when set, pads the left context to this many characters so multiple lines align in a fixed-width display.

Returns

  • A string.

Examples

iex> match = %Text.KWIC.Match{
...>   position: 1, left: ["the"], term: "cat", right: ["sat", "on"]
...> }
iex> Text.KWIC.format(match)
"the | cat | sat on"

iex> match = %Text.KWIC.Match{
...>   position: 1, left: ["the"], term: "cat", right: ["sat", "on"]
...> }
iex> Text.KWIC.format(match, separator: " ~ ")
"the ~ cat ~ sat on"