Text.KWIC finds every occurrence of a term in a text and returns each match with its surrounding context — the classic linguistic concordance view that lets you see at a glance how a word is used across a document, not just whether it appears.
This is the tool to reach for when you've found that some word matters (via Text.WordCloud, a frequency count, or a search hit) and you now want to actually look at every place it shows up. A frequency cloud tells you Earth appears six times in the corpus; a concordance tells you the Earth gets demolished, that it was mostly harmless, that the dolphins knew it was coming, and that Trillian is from there.
Quick start
text = """
The Earth was demolished to make way for a hyperspace bypass.
Trillian, formerly known as Tricia McMillan of Earth, is a brilliant
astrophysicist. The mice ran an experiment on the Earth. The dolphins
left Earth knowing what was coming. Mostly harmless: that's what the
revised Guide entry on Earth said.
"""
Text.KWIC.concordance(text, "Earth", context: 4)
#=> [
#=> %Text.KWIC.Match{
#=> position: 1,
#=> left: ["The"],
#=> term: "Earth",
#=> right: ["was", "demolished", "to", "make"]
#=> },
#=> %Text.KWIC.Match{
#=> position: 14,
#=> left: ["as", "Tricia", "McMillan", "of"],
#=> term: "Earth",
#=> right: ["is", "a", "brilliant", "astrophysicist"]
#=> },
#=> ...
#=> ]Each Text.KWIC.Match carries the full context: the matched token in its original casing, the tokens to its left and right, and its position in the document.
To pretty-print:
text
|> Text.KWIC.concordance("Earth", context: 4)
|> Enum.map(&Text.KWIC.format(&1, width: 30))
|> Enum.each(&IO.puts/1)
#=> The | Earth | was demolished to make
#=> as Tricia McMillan of | Earth | is a brilliant astrophysicist
#=> mice ran an experiment on | Earth | The dolphins left Earth knowing
#=> The dolphins left | Earth | knowing what was coming
#=> the revised Guide entry on | Earth | saidThe :width option pads the left context to a fixed width so the term column aligns cleanly across rows — the standard concordance presentation. Suddenly the uses of "Earth" pop out: it's an object of demolition, a place of origin, a subject of experiments, a thing-being-left, and a Guide entry.
Match struct
%Text.KWIC.Match{
position: 1, # zero-based token index in the input
left: ["The"], # tokens preceding the match
term: "Earth", # the matched token, in its ORIGINAL casing
right: ["was", "demolished", "to", "make"]
}:positionis useful for sorting, deduplicating, or correlating matches across different searches. It's the index in the tokenizer's output, not the byte offset — convert to a byte offset only if you also store the original input and re-tokenize.:termpreserves original case even when the search itself is case-insensitive. So a search for"earth"returnsEarth/earth/EARTHmatches with each surface form intact in the result.
Context width
The :context option controls how many tokens of left/right context come back. Default 5. Choose a value that fits your screen and your purpose:
Text.KWIC.concordance(text, "towel", context: 3)
# Tight, easy to scan in a list:
#=> [%Match{left: ["was", "given", "a"], term: "towel", right: ["—", "about", "the"]}]
Text.KWIC.concordance(text, "towel", context: 12)
# Wider, gives the full sentence-level context:
#=> [%Match{left: [..., "Arthur", "was", "given", "a"], term: "towel", right: [...]}]Two common rules of thumb:
For display (concordance lines printed to a terminal or web view): match the context to your row width. ~5–8 tokens fits comfortably on an 80-column terminal at default sizes.
For collocation analysis (programmatic — looking at what words consistently appear near the target): use a wider window like 10–20 to capture full-sentence relationships, then aggregate.
Case sensitivity
Searches are case-insensitive by default — "earth" matches Earth and earth both. Pass case_sensitive: true to require an exact match:
Text.KWIC.concordance(text, "Earth", case_sensitive: true)
# Only the original-casing "Earth" tokens.
Text.KWIC.concordance(text, "earth", case_sensitive: true)
# Empty if the corpus only uses "Earth" (capitalised).For literary or scholarly use case-sensitive search is sometimes important: distinguishing God from god, Paris (the place) from paris (the dispute), Plant from plant. For most engineering-style "find all uses" workflows, the case-insensitive default is what you want.
Custom tokenizers
concordance/3 defaults to Text.Segment.words/1 — Unicode UAX #29 word segmentation, locale-naive, punctuation-stripped. That's right for most text. Override with :tokenizer if you need different splitting:
# Keep punctuation as separate tokens (useful for code, structured text):
Text.KWIC.concordance(text, "earth",
tokenizer: &Text.Segment.words(&1, punctuation: :keep)
)
# Locale-aware word breaks (CJK, Thai, etc.):
Text.KWIC.concordance(thai_text, "ประเทศ",
tokenizer: &Text.Segment.words(&1, locale: "th")
)
# Custom tokenization (e.g. character-level for languages without word breaks):
Text.KWIC.concordance(text, "侃",
tokenizer: &String.graphemes/1
)The tokenizer is a single-arity function String.t() -> [String.t()]. Whatever it returns is what KWIC searches against; the :term you pass is matched against tokens as-is (after case folding, if requested).
Formatting
Text.KWIC.format/2 converts a single Match to a printable string:
match = hd(Text.KWIC.concordance(text, "demolished", context: 4))
Text.KWIC.format(match)
#=> "Earth was | demolished | to make way for a"
Text.KWIC.format(match, separator: " >>> ")
#=> "Earth was >>> demolished >>> to make way for a"
Text.KWIC.format(match, width: 25)
#=> " Earth was | demolished | to make way for a"The :width padding aligns the right edge of the left-context column. Pick a value at least as wide as your widest expected left-context to keep the term column fixed in position across rows — this is the visual cue that makes a concordance table actually scannable.
Patterns
Collocate scan
What words consistently appear near a target term?
collocates =
text
|> Text.KWIC.concordance("Vogon", context: 5)
|> Enum.flat_map(fn m -> m.left ++ m.right end)
|> Enum.map(&String.downcase/1)
|> Enum.frequencies()
|> Enum.sort_by(fn {_w, n} -> -n end)
|> Enum.take(10)For larger corpora reach for Text.Collocation.bigrams/2 instead — KWIC is meant for human-readable inspection of a handful of matches, not bulk co-occurrence statistics.
Sense disambiguation
When a word has multiple senses, the contexts cluster:
Text.KWIC.concordance(corpus, "fish", context: 6)
# Some matches will be about the babel fish (a creature);
# others about "thanks for all the fish" (the dolphins' farewell).
# Visually obvious in the concordance, hard to spot in a frequency view.Finding all forms
KWIC searches for an exact token. Inflected variants need separate calls or a stem-bucket pre-pass:
~w[demolish demolished demolishing demolition]
|> Enum.flat_map(&Text.KWIC.concordance(text, &1, context: 4))
|> Enum.sort_by(& &1.position)For a more general "all morphological variants" search, run Text.WordCloud.terms/2 with stem: true to find the surface forms present in the corpus, then loop them through KWIC.
Cross-document concordance
Each concordance/3 call works on one input. To search across many documents, loop and tag the results:
documents = %{
"h2g2" => h2g2_text,
"rest_at_eou" => rest_at_eou_text,
"life_universe" => life_universe_text
}
documents
|> Enum.flat_map(fn {doc_id, text} ->
text
|> Text.KWIC.concordance("Vogon", context: 5)
|> Enum.map(&{doc_id, &1})
end)
|> Enum.each(fn {doc_id, m} ->
IO.puts("#{doc_id}\t#{Text.KWIC.format(m, width: 30)}")
end)Adds a document column to the concordance — handy for tracing how a term's usage drifts across a series.
Limitations
Single-token search only. Multi-word phrases (
"hyperspace bypass") need a separate pass — tokenize the corpus, find positions where the phrase tokens align consecutively, and build the match manually. A future revision may add native phrase support; for now it's outside scope.No regex / wildcard. The term match is exact-equality against the (folded) token. For pattern matching, pre-process the corpus to extract the matching tokens, then concordance each one.
Whole-document tokenization.
concordance/3tokenizes the entire input on every call. For very large corpora that you'll be searching many times, pre-tokenize once and pass the token list directly via a custom:tokenizerthat just returns the cached list:cached_tokens = Text.Segment.words(huge_text) noop_tokenizer = fn _text -> cached_tokens end Text.KWIC.concordance(huge_text, "earth", tokenizer: noop_tokenizer)Position is token-indexed, not byte-indexed. Cross-reference with the original input requires re-tokenization or carrying a position-to-offset map alongside the matches.
When to reach for KWIC
| Question | Tool |
|---|---|
| "Does this word appear in the document?" | String.contains?/2 |
| "Where does this word appear?" | String.split/2 then walk |
| "How frequently does this word appear?" | Text.WordCloud.terms with :frequency scoring |
| "What does this word mean here — how is it used?" | Text.KWIC.concordance |
| "What words are statistically associated with this one?" | Text.Collocation.bigrams |
| "Highlight every occurrence in a UI" | Text.KWIC.concordance + render position and term slices |
Concordances are a literary-scholarship and corpus-linguistics tool by lineage — but they're equally useful in any context where you've zeroed in on an interesting word and now need to look at the actual sentences it lives in. Don't panic.