textmetrics
String comparison metrics for Gleam: edit distances, similarity scores, longest common subsequence, and line-level diff.
Features
- Edit distances: Levenshtein, Damerau-Levenshtein (true variant), OSA, Hamming
- Similarity scores in
[0.0, 1.0]: Jaro, Jaro-Winkler, Sørensen-Dice - Longest common subsequence (length and a recovered sequence)
- Diff: Myers (1986) and patience, plus POSIX unified-diff rendering
did_you_meanand Jaro-Winkler ranking for spell-correction style search- Pure Gleam, runs on Erlang and JavaScript targets
- Operates on Unicode grapheme clusters (UAX #29)
Install
gleam add textmetrics
Suggesting a similar command name
When a user mistypes a subcommand or flag, fall back to a Levenshtein-bounded
search to suggest the closest known names. Ties are broken by the order they
appear in candidates, which keeps suggestions deterministic.
import textmetrics/search
pub fn suggest_command(typed: String) -> List(String) {
let known = ["install", "uninstall", "remove", "update", "help"]
search.did_you_mean(typed, known, 2)
}
// suggest_command("instal") -> ["install"]
// suggest_command("updat") -> ["update"]
// suggest_command("xyz") -> []
search.closest is the convenience form when only one suggestion is
needed (the typical “Did you mean X?” path).
import textmetrics/search
pub fn one_suggestion(typed: String) -> Result(String, Nil) {
let known = ["install", "uninstall", "remove", "update", "help"]
search.closest(typed, known, 2)
}
// one_suggestion("instal") -> Ok("install")
// one_suggestion("xyz") -> Error(Nil)
Comparing strings: Levenshtein, Damerau-Levenshtein, OSA
levenshtein counts the minimum number of insert / delete / substitute
operations. damerau_levenshtein adds adjacent-grapheme transposition and
lets the same substring participate in multiple edits. osa uses the same
operations but allows each substring to be edited at most once — that is
what most “Damerau distance” libraries actually compute.
import textmetrics/distance
pub fn distances() -> #(Int, Int, Int) {
let l = distance.levenshtein("CA", "ABC")
let dl = distance.damerau_levenshtein("CA", "ABC")
let o = distance.osa("CA", "ABC")
#(l, dl, o)
}
// distances() -> #(3, 2, 3)
distance.normalized_levenshtein rescales Levenshtein distance to a
similarity in [0.0, 1.0] (1.0 means identical). Use it when ranking
by edit distance is preferred over Jaro-Winkler.
import textmetrics/distance
pub fn levenshtein_similarity() -> Float {
// levenshtein = 3, max graphemes = 7 → 1 - 3/7 = 4/7.
distance.normalized_levenshtein("kitten", "sitting")
}
hamming requires equal-length inputs and returns
Error(LengthMismatch(...)) otherwise, so callers do not have to
pre-validate the input lengths.
import textmetrics/distance
pub fn hamming_check(
a: String,
b: String,
) -> Result(Int, distance.HammingError) {
distance.hamming(a, b)
}
// hamming_check("karolin", "kathrin") -> Ok(3)
// hamming_check("ab", "abc") -> Error(LengthMismatch(left: 2, right: 3))
Scoring similarity: Jaro, Jaro-Winkler, Sørensen-Dice
jaro_winkler boosts scores for strings that share a common prefix, which
suits typo tolerance in human names. sorensen_dice compares the multiset
of grapheme n-grams.
import textmetrics/similarity
pub fn jaro_score() -> Float {
similarity.jaro("MARTHA", "MARHTA")
}
pub fn jaro_winkler_score() -> Float {
similarity.jaro_winkler("MARTHA", "MARHTA")
}
pub fn dice_score() -> Result(Float, similarity.SorensenDiceError) {
similarity.sorensen_dice("night", "nacht", 2)
}
// jaro_score() -> 0.944_444…
// jaro_winkler_score() -> 0.961_111…
// dice_score() -> Ok(0.25)
jaro_winkler_with accepts a validated JaroWinklerConfig for non-default
prefix scaling. The smart constructor enforces prefix_scale ∈ [0.0, 0.25]
and prefix_max >= 0.
import gleam/result
import textmetrics/similarity
pub fn aggressive_winkler() -> Result(
Float,
similarity.JaroWinklerConfigError,
) {
use cfg <- result.map(similarity.jaro_winkler_config(
prefix_scale: 0.2,
prefix_max: 6,
))
similarity.jaro_winkler_with("MARTHA", "MARHTA", cfg)
}
Producing an edit script and a unified diff
diff.myers returns an EditScript — a list of Equal | Insert | Delete
steps that round-trips both inputs. to_unified renders that script in
the format diff -u produces.
import textmetrics/diff
pub fn render_diff() -> String {
let old = ["the quick brown", "fox jumps over", "the lazy dog"]
let new = ["the quick brown", "fox leaps over", "the lazy dog"]
let opts = diff.unified_options(old_name: "a", new_name: "b")
diff.to_unified(diff.myers(old, new), opts)
}
// render_diff() ->
// --- a
// +++ b
// @@ -1,3 +1,3 @@
// the quick brown
// -fox jumps over
// +fox leaps over
// the lazy dog
recover_old and recover_new rebuild the inputs from a script. The
round-trip property is the principal invariant of the Edit ADT.
import textmetrics/diff
import textmetrics/edit
pub fn round_trip_holds() -> Bool {
let old = ["a", "b", "c"]
let new = ["a", "x", "c"]
let script = diff.myers(old, new)
edit.recover_old(script) == old && edit.recover_new(script) == new
}
// round_trip_holds() -> True
Longest common subsequence
import textmetrics/lcs
pub fn lcs_example() -> #(Int, List(String)) {
let a = ["A", "B", "C", "B", "D", "A", "B"]
let b = ["B", "D", "C", "A", "B", "A"]
#(lcs.length(a, b), lcs.sequence(a, b))
}
// lcs_example() -> #(4, ["B", "C", "B", "A"]) // one valid answer
lcs.length is well-defined; lcs.sequence returns one valid longest
common subsequence. Consumers should rely on
length(sequence(a, b)) == length(a, b) rather than on the specific
subsequence chosen.
Unicode policy
All string-typed functions operate on extended grapheme clusters via
gleam/string.to_graphemes. The library does not normalize input — NFC
and NFD strings that render identically are reported as different.
Callers wanting NFC equivalence should normalize before invoking.
Development
mise install
just deps
just ci
just recipes source scripts/lib/mise_bootstrap.sh, so mise activate
is not required in the current shell.