Comparison
Compared to many other libraries in this space, the main focus of this one is measuring the real width of things when printed to terminal emulators. This library does not try to guess what font shaping (especially ligatures) would do, and cares less about the way things look like in other monospace contexts, such as using a modern rendering pipeline (e.g. in a browser) but with a monospace font. I encourage you to use your font rendering library directly, or find a way to layout things without needing to know the size of your text.
Unicode version
The data files are taken from the Unicode 16 release (Sep 2024).
Algorithm
Components (codepoints, control sequences and grapheme clusters) are classified into 4 catgories:
- zero-width: these do not advance the current column
- narrow: these advance the cursor by one column
- wide: these advance the cursor by two columns
- ambiguous: these advance the cursor either by one or two columns, depending
on the
ambiguous_as_wide
option.
Control characters and ansi escape sequences are ignored and reported as zero-width, except:
- Newlines (
\n
and\r\n
) increase the current row, while resetting the column. - Tabs (
\t
) increase the current column to the next tab stop.
For the remaining string, 3 different modes are supported: wcwidth
, mode_2027
and mode2027_ext
.
Default mode (wcwidth
)
By default, this library tries to match the output of the wcwidth
function from
glibc. This function is still widely used by the majority of terminal emulators.
In this mode, the width of a string is the sum of the widths of its individual code points. (Extended) grapheme cluster boundaries are ignored.
(Extended) Mode 2027
Mode 2027 is a proposed mode for terminal emulators that applications can request. When active and supported, the terminal emulator is supposed to follow the terminal unicode core spec. This mode is supported by some terminals, and even the default and only behaviour in some others.
None of the terminals I tested implement the proposed spec fully, and the exact behaviour is subject to ongoing discussion in the freedesktop.org terminal working group. Support for this mode should therefore be considered best-effort at the moment, especially when support for Brahmic, Arabic, and some east asian scripts is required. If you just want your emoji family to be 2 columns wide, this mode works well enough right now.
This library does the following:
- Extended grapheme clusters are segmented based on the
pop_grapheme
function. This function uses native target code, and is therefore not guaranteed to match the unicode version of this library (at the time of writing, Erlang uses 15.0, and Node.JS uses 15.1). - The width of a grapheme cluster is the maximum width of all codepoints that make it up.
- If a grapheme cluster contains U+FE0F Variant Selector 16 in addition to another non-zero-width codepoint, its width is wide.
Additionally, in the extended mode, the following rules apply:
- If a grapheme cluster contains two or more non-spacing non-zero-width
codepoints, its width is wide. “Spacing” codepoints in this context are all
codepoints with a
General_Category
ofSpacing_Mark
.
This rule makes the width reported by this library across Erlang and Nodejs more stable (Unicode version differences, see above), as well as closer match the behaviour of actual terminals with mode 2027 support. Starting with Unicode 15.1, some sequences that ocupy multiple columns are now segmented into single grapheme clusters.
Width of a single codepoint
The following codepoints are classified as zero-with:
- All codepoints with the
Default_Ignorable_Code_Point
property, except for U+115F Hangul Choseong Filler. - All codepoints with a
General_Category
ofControl
,Enclosing_Mark
,Nonspacing_Mark
,Paragraph_Separator
, orLine_Separator
. - All codepoints with
Hangul_Syllable_Type
of Vowel or Trailing Jamo. - All codepoints marked as
Prepended_Concatenation_Mark
U+fff9..U+fffb
, interlinear annotation format charactersU+13430..U+13440
, egyptian hieroglyph format characters
The following codepoints are classified as wide:
- All codepoints with the
Emoji_Presentation
property set. - All codepoints with an
East_Asian_Width
of Wide or Fullwidth.
The following codepoints are classified as ambiguous:
- All codepoints with an
East_Asian_Width
of Ambiguous.
All other codepoints are narrow.
Testing
I currently test against VTE (mainly gnome-terminal), Windows Terminal, Kitty, and foot for mode 2027 support. I also test on Contour since they originally proposed the mode 2027 spec; however, they use a custom Unicode library that I don’t trust fully.
When reporting a mismatch, please include which terminal and the escaped codepoints. Gitlab/Discord sometimes like to strip certain modifiers.