ucwidth v0.2.0 Ucwidth View Source

A module to determine the width of a Unicode charactor (or codepoint) on monotyped screens.

A quick comparing between full-width and half-width:

"丐" # 1 full-width grapheme
"gg" # 2 half-width graphemes

This module is originally ported from Dr Markus Kuhn's ucwidth library in C, but with updated Unicode database (v13.0.0 currently).

Furthermore, Emoji characters are supported, e.g:

iex> Ucwidth.width("🍭")
2

Functions provided by this module are grouped into:

width/2 for determining the display width
wide?/1, ambiguous?/1, combining?/1 for determining the property of a grapheme

Ambiguous width

According to the Unicode specification of East Asian Width, some characters have variable width, depending on the context. The left single quotation mark "‘" (\u{2018}), for example, may take one ore two cells depending on whether it is in a East Asian context or not.

see https://www.unicode.org/reports/tr11/#ED6 for more information.

This module provides an option to specify how ambiguous characters are treated. see width/2 for more information.

Combined Emoji characters

Sticking to latest Unicode specifications, a combined Emoji grapheme's width is counted as if they are a single emoji, which is 2 cells. Please note not all terminals support latest version of Unicode specification, so there might be conflicts displaying these combined Emoji characters.

For example, the "woman scientist" emoji's width is 2:

iex> Ucwidth.width("👩‍🔬")
2

But in some terminals it may be displayed as 👩🔬

This problem is implementation related and this library sticks to canonical Unicode specifications.

Link to this section Summary

Functions

ambiguous?(codepoint_or_grapheme)

Check if a grapheme is ambiguous in Unicode.

combining?(codepoint_or_grapheme)

Check if a Unicode grapheme is a combining character.

wide?(codepoint_or_grapheme)

Check if a grapheme is wide in Unicode.

wide_or_ambiguous?(codepoint_or_grapheme)

Check if a grapheme is wide or ambiguous in Unicode.

width(codepoint_or_graphemes, ambiguous_as \\ :narrow)

Get width of a codepoint or grapheme.

Link to this section Functions

ambiguous?(codepoint_or_grapheme)

ambiguous?(non_neg_integer() | String.t()) :: boolean()

Check if a grapheme is ambiguous in Unicode.

The dataset is generated with uniset: uniset eaw:A

The display width of an ambiguous grapheme is termined based on the context provided. It might take two cells if in an East Asia content context, and one cell otherwise.

iex> Ucwidth.ambiguous?(0x273d)
true

iex> Ucwidth.ambiguous?("在")
false

combining?(codepoint_or_grapheme)

combining?(non_neg_integer() | String.t()) :: boolean()

Check if a Unicode grapheme is a combining character.

The dataset is generated with uniset: uniset cat:Me,Mn,Cf + U+00AD + U+1160..U+11FF + U+200B + U+000C

For example:

iex> Ucwidth.combining?("\u061c")
true

iex> Ucwidth.combining?("-")
false

wide?(codepoint_or_grapheme)

wide?(non_neg_integer() | String.t()) :: boolean()

Check if a grapheme is wide in Unicode.

The dataset is generated with uniset: uniset eaw:W,F

A grapheme is considered wide only if it:

is East Asia Wide, or
is East Asia Fullwidth

wide_or_ambiguous?(codepoint_or_grapheme)

wide_or_ambiguous?(non_neg_integer() | String.t()) :: boolean()

Check if a grapheme is wide or ambiguous in Unicode.

The dataset is generated with uniset: uniset eaw:W,F,A

see wide?/1 for definition of wide. see ambiguous?/1 for definition of ambiguous.

width(codepoint_or_graphemes, ambiguous_as \\ :narrow)

width(non_neg_integer() | String.t(), :wide | :narrow) ::
  0 | 1 | 2 | {:error, :bad_arg}

Get width of a codepoint or grapheme.

Parameters

codepoint_or_graphemes - a string or unicode codepoint
- an integer within valid unicode code range (0..0x11ffff)
- a string, e.g "c", "\u{3f0a1}", "hey"
ambiguous_as - the treament of ambiguous characters, by default :narrow
- :narrow - treated as f they are narrow
- :wide - treated as if they are wide
  
  For example:
```
  iex> Ucwidth.width("\u00a1", :narrow)
  1

  iex> Ucwidth.width("\u00a1", :wide)
  2
```

Return values

Returns the width of the grapheme/codepoint:

0 means this grapheme is invisible and takes no space on screen.
1 means it takes one cell to display. For instance, English letters are one cell wide.
2 means it takes two cells to display. This is quite common in East Asian charsets.

Examples

iex> Ucwidth.width(0)
0

iex> Ucwidth.width("5")
1

iex> Ucwidth.width("〿")
1

iex> Ucwidth.width("〈")
2

iex> Ucwidth.width("⺀")
2

iex> Ucwidth.width(255)
1

If string length is greater than 1, the sum of its graphemes' width is returned.

iex> Ucwidth.width("abc")
3

iex> Ucwidth.width("仓仓")
4