ucwidth v0.2.0 Ucwidth View Source

A module to determine the width of a Unicode charactor (or codepoint) on monotyped screens.

A quick comparing between full-width and half-width:

"丐" # 1 full-width grapheme
"gg" # 2 half-width graphemes

This module is originally ported from Dr Markus Kuhn's ucwidth library in C, but with updated Unicode database (v13.0.0 currently).

Furthermore, Emoji characters are supported, e.g:

iex> Ucwidth.width("🍭")
2

Functions provided by this module are grouped into:

Ambiguous width

According to the Unicode specification of East Asian Width, some characters have variable width, depending on the context. The left single quotation mark "β€˜" (\u{2018}), for example, may take one ore two cells depending on whether it is in a East Asian context or not.

see https://www.unicode.org/reports/tr11/#ED6 for more information.

This module provides an option to specify how ambiguous characters are treated. see width/2 for more information.

Combined Emoji characters

Sticking to latest Unicode specifications, a combined Emoji grapheme's width is counted as if they are a single emoji, which is 2 cells. Please note not all terminals support latest version of Unicode specification, so there might be conflicts displaying these combined Emoji characters.

For example, the "woman scientist" emoji's width is 2:

iex> Ucwidth.width("πŸ‘©β€πŸ”¬")
2

But in some terminals it may be displayed as πŸ‘©πŸ”¬

This problem is implementation related and this library sticks to canonical Unicode specifications.

Link to this section Summary

Functions

Check if a grapheme is ambiguous in Unicode.

Check if a Unicode grapheme is a combining character.

Check if a grapheme is wide in Unicode.

Check if a grapheme is wide or ambiguous in Unicode.

Get width of a codepoint or grapheme.

Link to this section Functions

Link to this function

ambiguous?(codepoint_or_grapheme)

View Source
ambiguous?(non_neg_integer() | String.t()) :: boolean()

Check if a grapheme is ambiguous in Unicode.

The dataset is generated with uniset: uniset eaw:A

The display width of an ambiguous grapheme is termined based on the context provided. It might take two cells if in an East Asia content context, and one cell otherwise.

iex> Ucwidth.ambiguous?(0x273d)
true

iex> Ucwidth.ambiguous?("在")
false
Link to this function

combining?(codepoint_or_grapheme)

View Source
combining?(non_neg_integer() | String.t()) :: boolean()

Check if a Unicode grapheme is a combining character.

The dataset is generated with uniset: uniset cat:Me,Mn,Cf + U+00AD + U+1160..U+11FF + U+200B + U+000C

For example:

iex> Ucwidth.combining?("\u061c")
true

iex> Ucwidth.combining?("-")
false
Link to this function

wide?(codepoint_or_grapheme)

View Source
wide?(non_neg_integer() | String.t()) :: boolean()

Check if a grapheme is wide in Unicode.

The dataset is generated with uniset: uniset eaw:W,F

A grapheme is considered wide only if it:

  • is East Asia Wide, or
  • is East Asia Fullwidth
Link to this function

wide_or_ambiguous?(codepoint_or_grapheme)

View Source
wide_or_ambiguous?(non_neg_integer() | String.t()) :: boolean()

Check if a grapheme is wide or ambiguous in Unicode.

The dataset is generated with uniset: uniset eaw:W,F,A

see wide?/1 for definition of wide. see ambiguous?/1 for definition of ambiguous.

Link to this function

width(codepoint_or_graphemes, ambiguous_as \\ :narrow)

View Source
width(non_neg_integer() | String.t(), :wide | :narrow) ::
  0 | 1 | 2 | {:error, :bad_arg}

Get width of a codepoint or grapheme.

Parameters

  • codepoint_or_graphemes - a string or unicode codepoint

    • an integer within valid unicode code range (0..0x11ffff)
    • a string, e.g "c", "\u{3f0a1}", "hey"
  • ambiguous_as - the treament of ambiguous characters, by default :narrow

    • :narrow - treated as f they are narrow
    • :wide - treated as if they are wide

      For example:

      iex> Ucwidth.width("\u00a1", :narrow)
      1
    
      iex> Ucwidth.width("\u00a1", :wide)
      2

Return values

Returns the width of the grapheme/codepoint:

  • 0 means this grapheme is invisible and takes no space on screen.
  • 1 means it takes one cell to display. For instance, English letters are one cell wide.
  • 2 means it takes two cells to display. This is quite common in East Asian charsets.

Examples

iex> Ucwidth.width(0)
0

iex> Ucwidth.width("5")
1

iex> Ucwidth.width("γ€Ώ")
1

iex> Ucwidth.width("〈")
2

iex> Ucwidth.width("βΊ€")
2

iex> Ucwidth.width(255)
1

If string length is greater than 1, the sum of its graphemes' width is returned.

iex> Ucwidth.width("abc")
3

iex> Ucwidth.width("δ»“δ»“")
4