View Source Unicode (Unicode v1.20.0)

Functions to introspect the Unicode character database and to provide fast codepoint lookups for scripts, blocks, categories and properties.

Summary

Types

A codepoint is an integer representing a Unicode character

A codepoint or a string

Unicode UTF encodings

The valid scripts as of Unicode 15

Functions

Returns a list of tuples representing the full range of Unicode code points.

Returns true if a single Unicode codepoint (or all characters in the given string) adhere to the Derived Core Property Alphabetic otherwise returns false.

Returns true if a single Unicode codepoint (or all characters in the given string) are either alphabetic?/1 or numeric?/1 otherwise returns false.

Returns a list of tuples representing the assigned ranges of Unicode code points.

Returns the block name of a codepoint or the list of block names for each codepoint in a string.

Returns either true if the codepoint has the :cased property or false.

Returns the Unicode category for a codepoint or a list of categories for a string.

Returns true if a single Unicode codepoint (or all characters in the given string) adhere to Unicode category :Nd otherwise returns false.

Returns true if a single Unicode codepoint (or all characters in the given string) are emoji otherwise returns false.

Returns true if a single Unicode codepoint (or all characters in the given string) the category :Ll otherwise returns false.

Returns true if a single Unicode codepoint (or all characters in the given string) the category :Sm otherwise returns false.

Returns true if a single Unicode codepoint (or all characters in the given string) adhere to Unicode categories :Nd, :Nl and :No otherwise returns false.

Returns the list of properties of each codepoint in a given string or the list of properties for a given string.

Returns a map of aliases mapping property names to a module that serves that property.

ranges() deprecated

Ensures that a binary is valid UTF encoded.

Returns the script name of a codepoint or the list of block names for each codepoint in a string.

Returns a keyword list of scripts in descending dominance order for a given string.

Returns the first index and grapheme count of each script detected in a string.

Removes accents (diacritical marks) from a string.

Returns true if a single Unicode codepoint (or all characters in the given string) the category :Lu otherwise returns false.

Returns the version of Unicode in use.

Types

@type codepoint() :: non_neg_integer()

A codepoint is an integer representing a Unicode character

@type codepoint_or_string() :: codepoint() | String.t()

A codepoint or a string

@type encoding() ::
  :utf8 | :utf16 | :utf16be | :utf16le | :utf32 | :utf32be | :utf32le

Unicode UTF encodings

@type script() ::
  :tangsa
  | :runic
  | :greek
  | :myanmar
  | :cherokee
  | :palmyrene
  | :elymaic
  | :latin

The valid scripts as of Unicode 15

Functions

@spec all() :: [{0, 1_114_111}]

Returns a list of tuples representing the full range of Unicode code points.

Link to this function

alphabetic?(codepoint_or_string)

View Source
@spec alphabetic?(codepoint_or_string()) :: boolean()

Returns true if a single Unicode codepoint (or all characters in the given string) adhere to the Derived Core Property Alphabetic otherwise returns false.

These are all characters that are usually used as representations of letters/syllabes in words/sentences.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • true or false

For the string-version, the result will be true only if all codepoints in the string adhere to the property.

Examples

iex> Unicode.alphabetic?(?a)
true

iex> Unicode.alphabetic?("A")
true

iex> Unicode.alphabetic?("Elixir")
true

iex> Unicode.alphabetic?("الإكسير")
true

# comma and whitespace
iex> Unicode.alphabetic?("foo, bar")
false

iex> Unicode.alphabetic?("42")
false

iex> Unicode.alphabetic?("龍王")
true

# Summation, ∑
iex> Unicode.alphabetic?("∑")
false

# Greek capital letter sigma, Σ
iex> Unicode.alphabetic?("Σ")
true
Link to this function

alphanumeric?(codepoint_or_string)

View Source
@spec alphanumeric?(codepoint_or_string()) :: boolean()

Returns true if a single Unicode codepoint (or all characters in the given string) are either alphabetic?/1 or numeric?/1 otherwise returns false.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • true or false

For the string-version, the result will be true only if all codepoints in the string adhere to the property.

Examples

iex> Unicode.alphanumeric? "1234"
true

iex> Unicode.alphanumeric? "KeyserSöze1995"
true

iex> Unicode.alphanumeric? "3段"
true

iex> Unicode.alphanumeric? "dragon@example.com"
false
@spec assigned() :: [{pos_integer(), pos_integer()}]

Returns a list of tuples representing the assigned ranges of Unicode code points.

This information is derived from the block ranges as defined by Unicode.Block.blocks/0.

Link to this function

block(codepoint_or_string)

View Source
@spec block(codepoint_or_string()) :: atom() | [atom(), ...]

Returns the block name of a codepoint or the list of block names for each codepoint in a string.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • in the case of a single codepoint, an atom block name

  • in the case of a string, a list of atom block names for each codepoint in the codepoint_or_string

Exmaples

iex> Unicode.block 
:latin_1_supplement

iex> Unicode.block ?A
:basic_latin

iex> Unicode.block "äA"
[:latin_1_supplement, :basic_latin]
Link to this function

cased?(codepoint_or_string)

View Source
@spec cased?(codepoint_or_string()) :: boolean()

Returns either true if the codepoint has the :cased property or false.

The :cased property means that this character has at least an upper and lower representation and possibly a titlecase representation too.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • true or false

For the string-version, the result will be true only if all codepoints in the string adhere to the property.

Examples

iex> Unicode.cased? ?ယ
false

iex> Unicode.cased? ?A
true
Link to this function

category(codepoint_or_string)

View Source
@spec category(codepoint_or_string()) :: atom() | [atom(), ...]

Returns the Unicode category for a codepoint or a list of categories for a string.

Argument

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • in the case of a single codepoint, an atom representing one of the categories listed below

  • in the case of a string, a list representing the category for each codepoint in the string

Notes

THese categories match the names of the Unicode character classes used in various regular expression engines and in Unicode Sets. The full list of categories is:

CategoryMatches
:COther
:CcControl
:CfFormat
:CnUnassigned
:CoPrivate use
:CsSurrogate
:LLetter
:LlLower case letter
:LmModifier letter
:LoOther letter
:LtTitle case letter
:LuUpper case letter
:MMark
:McSpacing mark
:MeEnclosing mark
:MnNon-spacing mark
:NNumber
:NdDecimal number
:NlLetter number
:NoOther number
:PPunctuation
:PcConnector punctuation
:PdDash punctuation
:PeClose punctuation
:PfFinal punctuation
:PiInitial punctuation
:PoOther punctuation
:PsOpen punctuation
:SSymbol
:ScCurrency symbol
:SkModifier symbol
:SmMathematical symbol
:SoOther symbol
:ZSeparator
:ZlLine separator
:ZpParagraph separator
:ZsSpace separator

Note too that the group level categories like :L, :M, :S and so on are not assigned to any codepoint. They can only be identified by combining the results for each of the subsidiary categories.

Examples

iex> Unicode.category 
:Ll

iex> Unicode.category ?A
:Lu

iex> Unicode.category ?🧐
:So

iex> Unicode.category ?+
:Sm

iex> Unicode.category ?1
:Nd

iex> Unicode.category "äA"
[:Ll, :Lu]
Link to this function

digits?(codepoint_or_string)

View Source
@spec digits?(codepoint_or_string()) :: boolean()

Returns true if a single Unicode codepoint (or all characters in the given string) adhere to Unicode category :Nd otherwise returns false.

This group of characters represents the decimal digits zero through nine (0..9) and the equivalents in non-Latin scripts.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • true or false

For the string-version, the result will be true only if all codepoints in the string adhere to the property.

For the string-version, the result will be true only if all codepoints in the string adhere to the property.

Examples

Link to this function

downcase?(codepoint_or_string)

View Source

See Unicode.Property.lowercase?/1.

Link to this function

emoji?(codepoint_or_string)

View Source
@spec emoji?(codepoint_or_string()) :: boolean()

Returns true if a single Unicode codepoint (or all characters in the given string) are emoji otherwise returns false.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • true or false

For the string-version, the result will be true only if all codepoints in the string adhere to the property.

Examples

iex> Unicode.emoji? "🧐🤓🤩🤩️🤯"
true
Link to this function

lowercase?(codepoint_or_string)

View Source
@spec lowercase?(codepoint_or_string()) :: boolean()

Returns true if a single Unicode codepoint (or all characters in the given string) the category :Ll otherwise returns false.

Notice that there are many languages that do not have a distinction between cases. Their characters are not included in this group.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • true or false

For the string-version, the result will be true only if all codepoints in the string adhere to the property.

Examples

iex> Unicode.lowercase?(?a)
true

iex> Unicode.lowercase?("A")
false

iex> Unicode.lowercase?("Elixir")
false

iex> Unicode.lowercase?("léon")
true

iex> Unicode.lowercase?("foo, bar")
false

iex> Unicode.lowercase?("42")
false

iex> Unicode.lowercase?("Σ")
false

iex> Unicode.lowercase?("σ")
true
Link to this function

math?(codepoint_or_string)

View Source
@spec math?(codepoint_or_string()) :: boolean()

Returns true if a single Unicode codepoint (or all characters in the given string) the category :Sm otherwise returns false.

These are all characters whose primary usage is in mathematical concepts (and not in alphabets). Notice that the numerical digits are not part of this group.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • true or false

For the string-version, the result will be true only if all codepoints in the string adhere to the property.

Examples

iex> Unicode.math?(?=)
true

iex> Unicode.math?("=")
true

iex> Unicode.math?("1+1=2") # Digits do not have the `:math` property.
false

iex> Unicode.math?("परिस")
false

iex> Unicode.math?("∑") # Summation, \u2211
true

iex> Unicode.math?("Σ") # Greek capital letter sigma, \u03a3
false
Link to this function

numeric?(codepoint_or_string)

View Source
@spec numeric?(codepoint_or_string()) :: boolean()

Returns true if a single Unicode codepoint (or all characters in the given string) adhere to Unicode categories :Nd, :Nl and :No otherwise returns false.

This group of characters represents the decimal digits zero through nine (0..9) and the equivalents in non-Latin scripts.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • true or false

For the string-version, the result will be true only if all codepoints in the string adhere to the property.

Examples

iex> Unicode.numeric?("65535")
true

iex> Unicode.numeric?("42")
true

iex> Unicode.numeric?("lapis philosophorum")
false
Link to this function

properties(codepoint_or_string)

View Source
@spec properties(codepoint_or_string()) :: [atom(), ...] | [[atom(), ...], ...]

Returns the list of properties of each codepoint in a given string or the list of properties for a given string.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • in the case of a single codepoint, an atom list of properties.

  • in the case of a string, one atom list for each codepoint in the codepoint_or_string.

Exmaples

iex> Unicode.properties 0x1bf0
[
  :alphabetic,
  :case_ignorable,
  :grapheme_extend,
  :id_continue,
  :incb,
  :other_alphabetic,
  :xid_continue
]

iex> Unicode.properties ?A
[
  :alphabetic,
  :ascii_hex_digit,
  :cased,
  :changes_when_casefolded,
  :changes_when_casemapped,
  :changes_when_lowercased,
  :grapheme_base,
  :hex_digit,
  :id_continue,
  :id_start,
  :uppercase,
  :xid_continue,
  :xid_start
]

iex> Unicode.properties ?+
[:grapheme_base, :math, :pattern_syntax]

iex> Unicode.properties "a1+"
[
  [
    :alphabetic,
    :ascii_hex_digit,
    :cased,
    :changes_when_casemapped,
    :changes_when_titlecased,
    :changes_when_uppercased,
    :grapheme_base,
    :hex_digit,
    :id_continue,
    :id_start,
    :lowercase,
    :xid_continue,
    :xid_start
  ],
  [
    :ascii_hex_digit,
    :emoji,
    :emoji_component,
    :grapheme_base,
    :hex_digit,
    :id_continue,
    :xid_continue
  ],
  [
    :grapheme_base,
    :math,
    :pattern_syntax
  ]
]

Returns a map of aliases mapping property names to a module that serves that property.

This function is deprecated. Use Unicode.assigned/0.
Link to this function

replace_invalid(binary, encoding \\ :utf8, replacement \\ "�")

View Source (since 1.18.0)
@spec replace_invalid(
  binary :: binary(),
  encoding :: encoding(),
  replacement :: String.t()
) :: binary()

Ensures that a binary is valid UTF encoded.

The string is validated by replacing any invalid UTF bytes or incomplete sequences with a replacement string.

Arguments

  • binary is any sequence of bytes.

  • encoding is any UTF encoding being one of :utf8, :utf16, :utf16be, :utf16le, :utf32, :utf32be or :utf32le. The default is :utf8.

  • replacement is any string that will be used to replace invalid UTF-8 bytes or incomplete sequences. The default is "�".

Returns

  • A valid UTF binary that may or may not include replacements for invalid UTF. If encoding is :utf8 then the return type is a String.t/0.

Notes

  • Unicode.replace_invalid(string, :utf8) will delegate to String.replace_invalid/2 where available, which is from Elixir 1.16 onwards.

Example

iex> Unicode.replace_invalid(<<"foo", 0b11111111, "bar">>, :utf8)
"foo�bar"
Link to this function

script(codepoint_or_string)

View Source
@spec script(codepoint_or_string()) :: String.t() | [String.t(), ...]

Returns the script name of a codepoint or the list of block names for each codepoint in a string.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • in the case of a single codepoint, a string script name

  • in the case of a string, a list of string script names for each codepoint in the codepoint_or_string

Exmaples

iex> Unicode.script 
:latin

iex> Unicode.script 
:arabic

iex> Unicode.script ?अ
:devanagari

iex> Unicode.script 
:hebrew

iex> Unicode.script 
:cyrillic

iex> Unicode.script 
:greek

iex> Unicode.script ?ก
:thai

iex> Unicode.script ?ယ
:myanmar
Link to this function

script_dominance(string)

View Source (since 1.16.0)
@spec script_dominance(String.t()) :: [{script(), {non_neg_integer(), pos_integer()}}]

Returns a keyword list of scripts in descending dominance order for a given string.

Dominance is determined by (in order of priority):

  • Index of the first occurrence of the script
  • Count of the number of graphemes in the script
  • Lexical ordering of the script name (used as a final means to ensure returning a deterministic result).

Arguments

Returns

  • A keyword list where the key is a script/0 and the value is a tuple where the first element is the index in the string where that script first appeared and the second element is the number of graphemes in that script. The list is ordered by descending dominance.

Example

iex> Unicode.script_dominance "Tokyo is the capital of 日本"
[latin: {0, 19}, common: {5, 5}, han: {24, 2}]

iex> Unicode.script_dominance "おはよう"
[hiragana: {0, 4}]
Link to this function

script_statistic(string)

View Source (since 1.16.0)
@spec script_statistic(String.t()) :: %{
  required(script()) => {non_neg_integer(), pos_integer()}
}

Returns the first index and grapheme count of each script detected in a string.

Arguments

Returns

  • A map where the key is a script/0 and the value is a tuple where the first element is the index in the string where that script first appeared and the second element is the number of graphemes in that script.

Examples

iex> Unicode.script_statistic "Tokyo is the capital of 日本"
%{common: {5, 5}, han: {24, 2}, latin: {0, 19}}

iex> Unicode.script_statistic "おはよう"
%{hiragana: {0, 4}}

Removes accents (diacritical marks) from a string.

Arguments

Returns

  • A string with all diacritical marks removed

Notes

The string is first normalised to :nfd form and then all characters in the block :comnbining_diacritical_marks is removed from the string

Example

iex> Unicode.unaccent("Et Ça sera sa moitié.")
"Et Ca sera sa moitie."
Link to this function

upcase?(codepoint_or_string)

View Source

See Unicode.Property.uppercase?/1.

Link to this function

uppercase?(codepoint_or_string)

View Source
@spec uppercase?(codepoint_or_string()) :: boolean()

Returns true if a single Unicode codepoint (or all characters in the given string) the category :Lu otherwise returns false.

Notice that there are many languages that do not have a distinction between cases. Their characters are not included in this group.

Arguments

  • codepoint_or_string is a single integer codepoint or a String.t/0.

Returns

  • true or false

For the string-version, the result will be true only if all codepoints in the string adhere to the property.

Examples

iex> Unicode.uppercase?(?a)
false

iex> Unicode.uppercase?("A")
true

iex> Unicode.uppercase?("Elixir")
false

iex> Unicode.uppercase?("CAMEMBERT")
true

iex> Unicode.uppercase?("foo, bar")
false

iex> Unicode.uppercase?("42")
false

iex> Unicode.uppercase?("Σ")
true

iex> Unicode.uppercase?("σ")
false

Returns the version of Unicode in use.