View Source Unicode (Unicode v1.20.0)
Functions to introspect the Unicode character database and to provide fast codepoint lookups for scripts, blocks, categories and properties.
Summary
Types
A codepoint is an integer representing a Unicode character
A codepoint or a string
Unicode UTF encodings
The valid scripts as of Unicode 15
Functions
Returns a list of tuples representing the full range of Unicode code points.
Returns true if a single Unicode codepoint (or all characters in the
given string) adhere to the Derived Core Property Alphabetic
otherwise returns false.
Returns true if a single Unicode codepoint (or all characters
in the given string) are either alphabetic?/1 or
numeric?/1 otherwise returns false.
Returns a list of tuples representing the assigned ranges of Unicode code points.
Returns the block name of a codepoint or the list of block names for each codepoint in a string.
Returns either true if the codepoint has the :cased property
or false.
Returns the Unicode category for a codepoint or a list of categories for a string.
Returns true if a single Unicode codepoint (or all characters
in the given string) adhere to Unicode category :Nd
otherwise returns false.
Returns true if a single Unicode codepoint (or all characters
in the given string) are emoji otherwise returns false.
Returns true if a single Unicode codepoint (or all characters
in the given string) the category :Ll otherwise returns false.
Returns true if a single Unicode codepoint (or all characters
in the given string) the category :Sm otherwise returns false.
Returns true if a single Unicode codepoint (or all characters
in the given string) adhere to Unicode categories :Nd,
:Nl and :No otherwise returns false.
Returns the list of properties of each codepoint in a given string or the list of properties for a given string.
Returns a map of aliases mapping property names to a module that serves that property.
Ensures that a binary is valid UTF encoded.
Returns the script name of a codepoint or the list of block names for each codepoint in a string.
Returns a keyword list of scripts in descending dominance order for a given string.
Returns the first index and grapheme count of each script detected in a string.
Removes accents (diacritical marks) from a string.
Returns true if a single Unicode codepoint (or all characters
in the given string) the category :Lu otherwise returns false.
Returns the version of Unicode in use.
Types
@type codepoint() :: non_neg_integer()
A codepoint is an integer representing a Unicode character
A codepoint or a string
@type encoding() ::
:utf8 | :utf16 | :utf16be | :utf16le | :utf32 | :utf32be | :utf32le
Unicode UTF encodings
@type script() ::
:tangsa
| :runic
| :greek
| :myanmar
| :cherokee
| :palmyrene
| :elymaic
| :latin
The valid scripts as of Unicode 15
Functions
@spec all() :: [{0, 1_114_111}]
Returns a list of tuples representing the full range of Unicode code points.
@spec alphabetic?(codepoint_or_string()) :: boolean()
Returns true if a single Unicode codepoint (or all characters in the
given string) adhere to the Derived Core Property Alphabetic
otherwise returns false.
These are all characters that are usually used as representations of letters/syllabes in words/sentences.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
trueorfalse
For the string-version, the result will be true only if all codepoints in the string adhere to the property.
Examples
iex> Unicode.alphabetic?(?a)
true
iex> Unicode.alphabetic?("A")
true
iex> Unicode.alphabetic?("Elixir")
true
iex> Unicode.alphabetic?("الإكسير")
true
# comma and whitespace
iex> Unicode.alphabetic?("foo, bar")
false
iex> Unicode.alphabetic?("42")
false
iex> Unicode.alphabetic?("龍王")
true
# Summation, ∑
iex> Unicode.alphabetic?("∑")
false
# Greek capital letter sigma, Σ
iex> Unicode.alphabetic?("Σ")
true
@spec alphanumeric?(codepoint_or_string()) :: boolean()
Returns true if a single Unicode codepoint (or all characters
in the given string) are either alphabetic?/1 or
numeric?/1 otherwise returns false.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
trueorfalse
For the string-version, the result will be true only if all codepoints in the string adhere to the property.
Examples
iex> Unicode.alphanumeric? "1234"
true
iex> Unicode.alphanumeric? "KeyserSöze1995"
true
iex> Unicode.alphanumeric? "3段"
true
iex> Unicode.alphanumeric? "dragon@example.com"
false
@spec assigned() :: [{pos_integer(), pos_integer()}]
Returns a list of tuples representing the assigned ranges of Unicode code points.
This information is derived from the block
ranges as defined by Unicode.Block.blocks/0.
@spec block(codepoint_or_string()) :: atom() | [atom(), ...]
Returns the block name of a codepoint or the list of block names for each codepoint in a string.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
in the case of a single codepoint, an atom block name
in the case of a string, a list of atom block names for each codepoint in the
codepoint_or_string
Exmaples
iex> Unicode.block ?ä
:latin_1_supplement
iex> Unicode.block ?A
:basic_latin
iex> Unicode.block "äA"
[:latin_1_supplement, :basic_latin]
@spec cased?(codepoint_or_string()) :: boolean()
Returns either true if the codepoint has the :cased property
or false.
The :cased property means that this character has at least
an upper and lower representation and possibly a titlecase
representation too.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
trueorfalse
For the string-version, the result will be true only if all codepoints in the string adhere to the property.
Examples
iex> Unicode.cased? ?ယ
false
iex> Unicode.cased? ?A
true
@spec category(codepoint_or_string()) :: atom() | [atom(), ...]
Returns the Unicode category for a codepoint or a list of categories for a string.
Argument
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
in the case of a single codepoint, an atom representing one of the categories listed below
in the case of a string, a list representing the category for each codepoint in the string
Notes
THese categories match the names of the Unicode character classes used in various regular expression engines and in Unicode Sets. The full list of categories is:
| Category | Matches |
|---|---|
| :C | Other |
| :Cc | Control |
| :Cf | Format |
| :Cn | Unassigned |
| :Co | Private use |
| :Cs | Surrogate |
| :L | Letter |
| :Ll | Lower case letter |
| :Lm | Modifier letter |
| :Lo | Other letter |
| :Lt | Title case letter |
| :Lu | Upper case letter |
| :M | Mark |
| :Mc | Spacing mark |
| :Me | Enclosing mark |
| :Mn | Non-spacing mark |
| :N | Number |
| :Nd | Decimal number |
| :Nl | Letter number |
| :No | Other number |
| :P | Punctuation |
| :Pc | Connector punctuation |
| :Pd | Dash punctuation |
| :Pe | Close punctuation |
| :Pf | Final punctuation |
| :Pi | Initial punctuation |
| :Po | Other punctuation |
| :Ps | Open punctuation |
| :S | Symbol |
| :Sc | Currency symbol |
| :Sk | Modifier symbol |
| :Sm | Mathematical symbol |
| :So | Other symbol |
| :Z | Separator |
| :Zl | Line separator |
| :Zp | Paragraph separator |
| :Zs | Space separator |
Note too that the group level categories like :L,
:M, :S and so on are not assigned to any codepoint.
They can only be identified by combining the results
for each of the subsidiary categories.
Examples
iex> Unicode.category ?ä
:Ll
iex> Unicode.category ?A
:Lu
iex> Unicode.category ?🧐
:So
iex> Unicode.category ?+
:Sm
iex> Unicode.category ?1
:Nd
iex> Unicode.category "äA"
[:Ll, :Lu]
@spec digits?(codepoint_or_string()) :: boolean()
Returns true if a single Unicode codepoint (or all characters
in the given string) adhere to Unicode category :Nd
otherwise returns false.
This group of characters represents the decimal digits zero through nine (0..9) and the equivalents in non-Latin scripts.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
trueorfalse
For the string-version, the result will be true only if all codepoints in the string adhere to the property.
For the string-version, the result will be true only if all codepoints in the string adhere to the property.
Examples
@spec emoji?(codepoint_or_string()) :: boolean()
Returns true if a single Unicode codepoint (or all characters
in the given string) are emoji otherwise returns false.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
trueorfalse
For the string-version, the result will be true only if all codepoints in the string adhere to the property.
Examples
iex> Unicode.emoji? "🧐🤓🤩🤩️🤯"
true
@spec lowercase?(codepoint_or_string()) :: boolean()
Returns true if a single Unicode codepoint (or all characters
in the given string) the category :Ll otherwise returns false.
Notice that there are many languages that do not have a distinction between cases. Their characters are not included in this group.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
trueorfalse
For the string-version, the result will be true only if all codepoints in the string adhere to the property.
Examples
iex> Unicode.lowercase?(?a)
true
iex> Unicode.lowercase?("A")
false
iex> Unicode.lowercase?("Elixir")
false
iex> Unicode.lowercase?("léon")
true
iex> Unicode.lowercase?("foo, bar")
false
iex> Unicode.lowercase?("42")
false
iex> Unicode.lowercase?("Σ")
false
iex> Unicode.lowercase?("σ")
true
@spec math?(codepoint_or_string()) :: boolean()
Returns true if a single Unicode codepoint (or all characters
in the given string) the category :Sm otherwise returns false.
These are all characters whose primary usage is in mathematical concepts (and not in alphabets). Notice that the numerical digits are not part of this group.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
trueorfalse
For the string-version, the result will be true only if all codepoints in the string adhere to the property.
Examples
iex> Unicode.math?(?=)
true
iex> Unicode.math?("=")
true
iex> Unicode.math?("1+1=2") # Digits do not have the `:math` property.
false
iex> Unicode.math?("परिस")
false
iex> Unicode.math?("∑") # Summation, \u2211
true
iex> Unicode.math?("Σ") # Greek capital letter sigma, \u03a3
false
@spec numeric?(codepoint_or_string()) :: boolean()
Returns true if a single Unicode codepoint (or all characters
in the given string) adhere to Unicode categories :Nd,
:Nl and :No otherwise returns false.
This group of characters represents the decimal digits zero through nine (0..9) and the equivalents in non-Latin scripts.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
trueorfalse
For the string-version, the result will be true only if all codepoints in the string adhere to the property.
Examples
iex> Unicode.numeric?("65535")
true
iex> Unicode.numeric?("42")
true
iex> Unicode.numeric?("lapis philosophorum")
false
@spec properties(codepoint_or_string()) :: [atom(), ...] | [[atom(), ...], ...]
Returns the list of properties of each codepoint in a given string or the list of properties for a given string.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
in the case of a single codepoint, an atom list of properties.
in the case of a string, one atom list for each codepoint in the
codepoint_or_string.
Exmaples
iex> Unicode.properties 0x1bf0
[
:alphabetic,
:case_ignorable,
:grapheme_extend,
:id_continue,
:incb,
:other_alphabetic,
:xid_continue
]
iex> Unicode.properties ?A
[
:alphabetic,
:ascii_hex_digit,
:cased,
:changes_when_casefolded,
:changes_when_casemapped,
:changes_when_lowercased,
:grapheme_base,
:hex_digit,
:id_continue,
:id_start,
:uppercase,
:xid_continue,
:xid_start
]
iex> Unicode.properties ?+
[:grapheme_base, :math, :pattern_syntax]
iex> Unicode.properties "a1+"
[
[
:alphabetic,
:ascii_hex_digit,
:cased,
:changes_when_casemapped,
:changes_when_titlecased,
:changes_when_uppercased,
:grapheme_base,
:hex_digit,
:id_continue,
:id_start,
:lowercase,
:xid_continue,
:xid_start
],
[
:ascii_hex_digit,
:emoji,
:emoji_component,
:grapheme_base,
:hex_digit,
:id_continue,
:xid_continue
],
[
:grapheme_base,
:math,
:pattern_syntax
]
]
Returns a map of aliases mapping property names to a module that serves that property.
replace_invalid(binary, encoding \\ :utf8, replacement \\ "�")
View Source (since 1.18.0)@spec replace_invalid( binary :: binary(), encoding :: encoding(), replacement :: String.t() ) :: binary()
Ensures that a binary is valid UTF encoded.
The string is validated by replacing any invalid UTF bytes or incomplete sequences with a replacement string.
Arguments
binaryis any sequence of bytes.encodingis any UTF encoding being one of:utf8,:utf16,:utf16be,:utf16le,:utf32,:utf32beor:utf32le. The default is:utf8.replacementis any string that will be used to replace invalid UTF-8 bytes or incomplete sequences. The default is"�".
Returns
- A valid UTF binary that may or may not include
replacements for invalid UTF. If
encodingis:utf8then the return type is aString.t/0.
Notes
Unicode.replace_invalid(string, :utf8)will delegate toString.replace_invalid/2where available, which is from Elixir 1.16 onwards.
Example
iex> Unicode.replace_invalid(<<"foo", 0b11111111, "bar">>, :utf8)
"foo�bar"
@spec script(codepoint_or_string()) :: String.t() | [String.t(), ...]
Returns the script name of a codepoint or the list of block names for each codepoint in a string.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
in the case of a single codepoint, a string script name
in the case of a string, a list of string script names for each codepoint in the
codepoint_or_string
Exmaples
iex> Unicode.script ?ä
:latin
iex> Unicode.script ?خ
:arabic
iex> Unicode.script ?अ
:devanagari
iex> Unicode.script ?א
:hebrew
iex> Unicode.script ?Ж
:cyrillic
iex> Unicode.script ?δ
:greek
iex> Unicode.script ?ก
:thai
iex> Unicode.script ?ယ
:myanmar
@spec script_dominance(String.t()) :: [{script(), {non_neg_integer(), pos_integer()}}]
Returns a keyword list of scripts in descending dominance order for a given string.
Dominance is determined by (in order of priority):
- Index of the first occurrence of the script
- Count of the number of graphemes in the script
- Lexical ordering of the script name (used as a final means to ensure returning a deterministic result).
Arguments
stringis anyString.t/0.
Returns
- A keyword list where the key is a
script/0and the value is a tuple where the first element is the index in the string where that script first appeared and the second element is the number of graphemes in that script. The list is ordered by descending dominance.
Example
iex> Unicode.script_dominance "Tokyo is the capital of 日本"
[latin: {0, 19}, common: {5, 5}, han: {24, 2}]
iex> Unicode.script_dominance "おはよう"
[hiragana: {0, 4}]
@spec script_statistic(String.t()) :: %{ required(script()) => {non_neg_integer(), pos_integer()} }
Returns the first index and grapheme count of each script detected in a string.
Arguments
stringis anyString.t/0.
Returns
- A map where the key is a
script/0and the value is a tuple where the first element is the index in the string where that script first appeared and the second element is the number of graphemes in that script.
Examples
iex> Unicode.script_statistic "Tokyo is the capital of 日本"
%{common: {5, 5}, han: {24, 2}, latin: {0, 19}}
iex> Unicode.script_statistic "おはよう"
%{hiragana: {0, 4}}
Removes accents (diacritical marks) from a string.
Arguments
stringis anyString.t/0
Returns
- A string with all diacritical marks removed
Notes
The string is first normalised to :nfd form
and then all characters in the block
:comnbining_diacritical_marks is removed
from the string
Example
iex> Unicode.unaccent("Et Ça sera sa moitié.")
"Et Ca sera sa moitie."
@spec uppercase?(codepoint_or_string()) :: boolean()
Returns true if a single Unicode codepoint (or all characters
in the given string) the category :Lu otherwise returns false.
Notice that there are many languages that do not have a distinction between cases. Their characters are not included in this group.
Arguments
codepoint_or_stringis a single integer codepoint or aString.t/0.
Returns
trueorfalse
For the string-version, the result will be true only if all codepoints in the string adhere to the property.
Examples
iex> Unicode.uppercase?(?a)
false
iex> Unicode.uppercase?("A")
true
iex> Unicode.uppercase?("Elixir")
false
iex> Unicode.uppercase?("CAMEMBERT")
true
iex> Unicode.uppercase?("foo, bar")
false
iex> Unicode.uppercase?("42")
false
iex> Unicode.uppercase?("Σ")
true
iex> Unicode.uppercase?("σ")
false
Returns the version of Unicode in use.