# `Unicode`
[🔗](https://github.com/elixir-unicode/unicode/blob/v1.21.1/lib/unicode.ex#L1)

Functions to introspect the Unicode character database and
to provide fast codepoint lookups for scripts, blocks,
categories and properties.

# `codepoint`

```elixir
@type codepoint() :: non_neg_integer()
```

A codepoint is an integer representing a Unicode character

# `codepoint_or_string`

```elixir
@type codepoint_or_string() :: codepoint() | String.t()
```

A codepoint or a string

# `encoding`

```elixir
@type encoding() ::
  :utf8 | :utf16 | :utf16be | :utf16le | :utf32 | :utf32be | :utf32le
```

Unicode UTF encodings

# `script`

```elixir
@type script() ::
  :tangsa
  | :runic
  | :greek
  | :myanmar
  | :cherokee
  | :palmyrene
  | :elymaic
  | :latin
```

The valid scripts as of Unicode 15

# `all`

```elixir
@spec all() :: [{0, 1_114_111}]
```

Returns a list of tuples representing the
full range of Unicode code points.

# `alphabetic?`

```elixir
@spec alphabetic?(codepoint_or_string()) :: boolean()
```

Returns `true` if a single Unicode codepoint (or all characters in the
given string) adhere to the Derived Core Property `Alphabetic`
otherwise returns `false`.

These are all characters that are usually used as representations
of letters/syllabes in words/sentences.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* `true` or `false`

For the string-version, the result will be true only if _all_
codepoints in the string adhere to the property.

## Examples

    iex> Unicode.alphabetic?(?a)
    true

    iex> Unicode.alphabetic?("A")
    true

    iex> Unicode.alphabetic?("Elixir")
    true

    iex> Unicode.alphabetic?("الإكسير")
    true

    # comma and whitespace
    iex> Unicode.alphabetic?("foo, bar")
    false

    iex> Unicode.alphabetic?("42")
    false

    iex> Unicode.alphabetic?("龍王")
    true

    # Summation, ∑
    iex> Unicode.alphabetic?("∑")
    false

    # Greek capital letter sigma, Σ
    iex> Unicode.alphabetic?("Σ")
    true

# `alphanumeric?`

```elixir
@spec alphanumeric?(codepoint_or_string()) :: boolean()
```

Returns `true` if a single Unicode codepoint (or all characters
in the given string) are either `alphabetic?/1` or
`numeric?/1` otherwise returns `false`.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* `true` or `false`

For the string-version, the result will be true only if _all_
codepoints in the string adhere to the property.

### Examples

    iex> Unicode.alphanumeric? "1234"
    true

    iex> Unicode.alphanumeric? "KeyserSöze1995"
    true

    iex> Unicode.alphanumeric? "3段"
    true

    iex> Unicode.alphanumeric? "dragon@example.com"
    false

# `assigned`

```elixir
@spec assigned() :: [{pos_integer(), pos_integer()}]
```

Returns a list of tuples representing the
assigned ranges of Unicode code points.

This information is derived from the block
ranges as defined by `Unicode.Block.blocks/0`.

# `block`

```elixir
@spec block(codepoint_or_string()) :: atom() | [atom(), ...]
```

Returns the block name of a codepoint
or the list of block names for each codepoint
in a string.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* in the case of a single codepoint, an atom
  block name

* in the case of a string, a list of atom
  block names for each codepoint in the
 `codepoint_or_string`

## Exmaples

    iex> Unicode.block ?ä
    :latin_1_supplement

    iex> Unicode.block ?A
    :basic_latin

    iex> Unicode.block "äA"
    [:latin_1_supplement, :basic_latin]

# `cased?`

```elixir
@spec cased?(codepoint_or_string()) :: boolean()
```

Returns either `true` if the codepoint has the `:cased` property
or `false`.

The `:cased` property means that this character has at least
an upper and lower representation and possibly a titlecase
representation too.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* `true` or `false`

For the string-version, the result will be true only if _all_
codepoints in the string adhere to the property.

## Examples

    iex> Unicode.cased? ?ယ
    false

    iex> Unicode.cased? ?A
    true

# `category`

```elixir
@spec category(codepoint_or_string()) :: atom() | [atom(), ...]
```

Returns the Unicode category for a codepoint or a list of
categories for a string.

## Argument

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* in the case of a single codepoint, an atom representing
  one of the categories listed below

* in the case of a string, a list representing the
  category for each codepoint in the string

## Notes

THese categories match the names of the Unicode character
classes used in various regular expression engines and in
Unicode Sets.  The full list of categories is:

| Category	| Matches	                |
| --------- | ----------------------- |
| :C	      | Other	                  |
| :Cc	      | Control	                |
| :Cf	      | Format	                |
| :Cn	      | Unassigned	            |
| :Co	      | Private use	            |
| :Cs	      | Surrogate	              |
| :L	      | Letter	                |
| :Ll	      | Lower case letter	      |
| :Lm	      | Modifier letter	        |
| :Lo	      | Other letter	          |
| :Lt	      | Title case letter	      |
| :Lu	      | Upper case letter	      |
| :M	      | Mark	                  |
| :Mc	      | Spacing mark	          |
| :Me	      | Enclosing mark	        |
| :Mn	      | Non-spacing mark	      |
| :N	      | Number	                |
| :Nd	      | Decimal number	        |
| :Nl	      | Letter number	          |
| :No	      | Other number	          |
| :P	      | Punctuation	            |
| :Pc	      | Connector punctuation	  |
| :Pd	      | Dash punctuation	      |
| :Pe	      | Close punctuation	      |
| :Pf	      | Final punctuation	      |
| :Pi	      | Initial punctuation	    |
| :Po	      | Other punctuation	      |
| :Ps	      | Open punctuation	      |
| :S	      | Symbol	                |
| :Sc	      | Currency symbol	        |
| :Sk	      | Modifier symbol	        |
| :Sm	      | Mathematical symbol	    |
| :So	      | Other symbol	          |
| :Z	      | Separator	              |
| :Zl	      | Line separator	        |
| :Zp	      | Paragraph separator	    |
| :Zs	      | Space separator	        |

Note too that the group level categories like `:L`,
`:M`, `:S` and so on are not assigned to any codepoint.
They can only be identified by combining the results
for each of the subsidiary categories.

## Examples

    iex> Unicode.category ?ä
    :Ll

    iex> Unicode.category ?A
    :Lu

    iex> Unicode.category ?🧐
    :So

    iex> Unicode.category ?+
    :Sm

    iex> Unicode.category ?1
    :Nd

    iex> Unicode.category "äA"
    [:Ll, :Lu]

# `digits?`

```elixir
@spec digits?(codepoint_or_string()) :: boolean()
```

Returns `true` if a single Unicode codepoint (or all characters
in the given string) adhere to Unicode category `:Nd`
otherwise returns `false`.

This group of characters represents the decimal digits zero
through nine (0..9) and the equivalents in non-Latin scripts.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* `true` or `false`

For the string-version, the result will be true only if _all_
codepoints in the string adhere to the property.

For the string-version, the result will be true only if _all_
codepoints in the string adhere to the property.

## Examples

# `downcase?`

# `emoji?`

```elixir
@spec emoji?(codepoint_or_string()) :: boolean()
```

Returns `true` if a single Unicode codepoint (or all characters
in the given string) are `emoji` otherwise returns `false`.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* `true` or `false`

For the string-version, the result will be true only if _all_
codepoints in the string adhere to the property.

### Examples

    iex> Unicode.emoji? "🧐🤓🤩🤩️🤯"
    true

# `lowercase?`

```elixir
@spec lowercase?(codepoint_or_string()) :: boolean()
```

Returns `true` if a single Unicode codepoint (or all characters
in the given string) the category `:Ll` otherwise returns `false`.

Notice that there are many languages that do not have a distinction
between cases. Their characters are not included in this group.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* `true` or `false`

For the string-version, the result will be true only if _all_
codepoints in the string adhere to the property.

## Examples

    iex> Unicode.lowercase?(?a)
    true

    iex> Unicode.lowercase?("A")
    false

    iex> Unicode.lowercase?("Elixir")
    false

    iex> Unicode.lowercase?("léon")
    true

    iex> Unicode.lowercase?("foo, bar")
    false

    iex> Unicode.lowercase?("42")
    false

    iex> Unicode.lowercase?("Σ")
    false

    iex> Unicode.lowercase?("σ")
    true

# `math?`

```elixir
@spec math?(codepoint_or_string()) :: boolean()
```

Returns `true` if a single Unicode codepoint (or all characters
in the given string) the category `:Sm` otherwise returns `false`.

These are all characters whose primary usage is in mathematical
concepts (and not in alphabets). Notice that the numerical digits
are not part of this group.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* `true` or `false`

For the string-version, the result will be true only if _all_
codepoints in the string adhere to the property.

## Examples

    iex> Unicode.math?(?=)
    true

    iex> Unicode.math?("=")
    true

    iex> Unicode.math?("1+1=2") # Digits do not have the `:math` property.
    false

    iex> Unicode.math?("परिस")
    false

    iex> Unicode.math?("∑") # Summation, \u2211
    true

    iex> Unicode.math?("Σ") # Greek capital letter sigma, \u03a3
    false

# `numeric?`

```elixir
@spec numeric?(codepoint_or_string()) :: boolean()
```

Returns `true` if a single Unicode codepoint (or all characters
in the given string) adhere to Unicode categories `:Nd`,
`:Nl` and `:No` otherwise returns `false`.

This group of characters represents the decimal digits zero
through nine (0..9) and the equivalents in non-Latin scripts.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* `true` or `false`

For the string-version, the result will be true only if _all_
codepoints in the string adhere to the property.

## Examples

    iex> Unicode.numeric?("65535")
    true

    iex> Unicode.numeric?("42")
    true

    iex> Unicode.numeric?("lapis philosophorum")
    false

# `properties`

```elixir
@spec properties(codepoint_or_string()) :: [atom(), ...] | [[atom(), ...], ...]
```

Returns the list of properties of each codepoint
in a given string or the list of properties for a
given string.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* in the case of a single codepoint, an atom
  list of properties.

* in the case of a string, one atom list for each
  codepoint in the `codepoint_or_string`.

## Exmaples

    iex> Unicode.properties 0x1bf0
    [
      :alphabetic,
      :case_ignorable,
      :grapheme_extend,
      :id_continue,
      :incb,
      :other_alphabetic,
      :xid_continue
    ]

    iex> Unicode.properties ?A
    [
      :alphabetic,
      :ascii_hex_digit,
      :cased,
      :changes_when_casefolded,
      :changes_when_casemapped,
      :changes_when_lowercased,
      :grapheme_base,
      :hex_digit,
      :id_continue,
      :id_start,
      :uppercase,
      :xid_continue,
      :xid_start
    ]

    iex> Unicode.properties ?+
    [:grapheme_base, :math, :pattern_syntax]

    iex> Unicode.properties "a1+"
    [
      [
        :alphabetic,
        :ascii_hex_digit,
        :cased,
        :changes_when_casemapped,
        :changes_when_titlecased,
        :changes_when_uppercased,
        :grapheme_base,
        :hex_digit,
        :id_continue,
        :id_start,
        :lowercase,
        :xid_continue,
        :xid_start
      ],
      [
        :ascii_hex_digit,
        :emoji,
        :emoji_component,
        :grapheme_base,
        :hex_digit,
        :id_continue,
        :xid_continue
      ],
      [
        :grapheme_base,
        :math,
        :pattern_syntax
      ]
    ]

# `property_servers`

Returns a map of aliases mapping
property names to a module that
serves that property.

# `ranges`

> This function is deprecated. Use Unicode.assigned/0.

# `replace_invalid`
*since 1.18.0* 

```elixir
@spec replace_invalid(
  binary :: binary(),
  encoding :: encoding(),
  replacement :: String.t()
) :: binary()
```

Ensures that a binary is valid UTF encoded.

The string is validated by replacing any invalid UTF
bytes or incomplete sequences with a replacement string.

### Arguments

* `binary` is any sequence of bytes.

* `encoding` is any UTF encoding being one of
  `:utf8`, `:utf16`, `:utf16be`, `:utf16le`, `:utf32`, `:utf32be` or
  `:utf32le`. The default is `:utf8`.

* `replacement` is any string that will be used to replace
  invalid UTF-8 bytes or incomplete sequences. The default
  is `"�"`.

### Returns

* A valid UTF binary that may or may not include
  replacements for invalid UTF. If `encoding` is `:utf8`
  then the return type is a `t:String.t/0`.

## Notes

* `Unicode.replace_invalid(string, :utf8)` will delegate to
  `String.replace_invalid/2` where available, which is from
  Elixir 1.16 onwards.

### Example

    iex> Unicode.replace_invalid(<<"foo", 0b11111111, "bar">>, :utf8)
    "foo�bar"

# `script`

```elixir
@spec script(codepoint_or_string()) :: String.t() | [String.t(), ...]
```

Returns the script name of a codepoint
or the list of block names for each codepoint
in a string.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* in the case of a single codepoint, a string
  script name

* in the case of a string, a list of string
  script names for each codepoint in the
` codepoint_or_string`

## Exmaples

    iex> Unicode.script ?ä
    :latin

    iex> Unicode.script ?خ
    :arabic

    iex> Unicode.script ?अ
    :devanagari

    iex> Unicode.script ?א
    :hebrew

    iex> Unicode.script ?Ж
    :cyrillic

    iex> Unicode.script ?δ
    :greek

    iex> Unicode.script ?ก
    :thai

    iex> Unicode.script ?ယ
    :myanmar

# `script_dominance`
*since 1.16.0* 

```elixir
@spec script_dominance(String.t()) :: [{script(), {non_neg_integer(), pos_integer()}}]
```

Returns a keyword list of scripts in descending dominance
order for a given string.

Dominance is determined by (in order of priority):

* Index of the first occurrence of the script
* Count of the number of graphemes in the script
* Lexical ordering of the script name (used as a final means
  to ensure returning a deterministic result).

## Arguments

* `string` is any `t:String.t/0`.

## Returns

* A keyword list where the key is a `t:script/0` and the value
  is a tuple where the first element is the index in the
  string where that script first appeared and the second
  element is the number of graphemes in that script. The list
  is ordered by descending dominance.

## Example

    iex> Unicode.script_dominance "Tokyo is the capital of 日本"
    [latin: {0, 19}, common: {5, 5}, han: {24, 2}]

    iex> Unicode.script_dominance "おはよう"
    [hiragana: {0, 4}]

# `script_statistic`
*since 1.16.0* 

```elixir
@spec script_statistic(String.t()) :: %{
  required(script()) =&gt; {non_neg_integer(), pos_integer()}
}
```

Returns the first index and grapheme count of each
script detected in a string.

## Arguments

* `string` is any `t:String.t/0`.

## Returns

* A map where the key is a `t:script/0` and the value
  is a tuple where the first element is the index in the
  string where that script first appeared and the second
  element is the number of graphemes in that script.

## Examples

    iex> Unicode.script_statistic "Tokyo is the capital of 日本"
    %{common: {5, 5}, han: {24, 2}, latin: {0, 19}}

    iex> Unicode.script_statistic "おはよう"
    %{hiragana: {0, 4}}

# `unaccent`

Removes accents (diacritical marks) from
a string.

## Arguments

* `string` is any `t:String.t/0`

## Returns

* A string with all diacritical marks
  removed

## Notes

The string is first normalised to `:nfd` form
and then all characters in the block
`:comnbining_diacritical_marks` is removed
from the string

## Example

    iex> Unicode.unaccent("Et Ça sera sa moitié.")
    "Et Ca sera sa moitie."

# `upcase?`

# `uppercase?`

```elixir
@spec uppercase?(codepoint_or_string()) :: boolean()
```

Returns `true` if a single Unicode codepoint (or all characters
in the given string) the category `:Lu` otherwise returns `false`.

Notice that there are many languages that do not have a distinction
between cases. Their characters are not included in this group.

## Arguments

* `codepoint_or_string` is a single integer codepoint
  or a `t:String.t/0`.

## Returns

* `true` or `false`

For the string-version, the result will be true only if _all_
codepoints in the string adhere to the property.
## Examples

    iex> Unicode.uppercase?(?a)
    false

    iex> Unicode.uppercase?("A")
    true

    iex> Unicode.uppercase?("Elixir")
    false

    iex> Unicode.uppercase?("CAMEMBERT")
    true

    iex> Unicode.uppercase?("foo, bar")
    false

    iex> Unicode.uppercase?("42")
    false

    iex> Unicode.uppercase?("Σ")
    true

    iex> Unicode.uppercase?("σ")
    false

# `version`

Returns the version of Unicode in use.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
