# `Unicode.String`
[🔗](https://github.com/elixir-unicode/unicode_string/blob/v2.1.0/lib/unicode/string.ex#L1)

This module provides functions that implement some
of the [Unicode](https://unicode.org) standards:

* The [Unicode Case Mapping](https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf) algorithm
  to provide mapping to upper, lower and title case text.

* The [Unicode Case Folding](https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf) algorithm
  to provide case-independent equality checking irrespective of language or script.

* The [Unicode Segmentation](https://unicode.org/reports/tr29/) algorithm to detect,
  break or split strings into grapheme clusters, words and sentences.

* The [Unicode Line Breaking](https://www.unicode.org/reports/tr14/) algorithm to determine
  line break placement to support word-wrapping.

# `break_match`

```elixir
@type break_match() ::
  {break_or_no_break(), {String.t(), {String.t(), String.t()}}}
  | {break_or_no_break(), {String.t(), String.t()}}
```

# `break_or_no_break`

```elixir
@type break_or_no_break() :: :break | :no_break
```

# `break_type`

```elixir
@type break_type() :: :grapheme | :word | :line | :sentence
```

# `error_return`

```elixir
@type error_return() :: {:error, String.t()}
```

# `mode_or_language`

```elixir
@type mode_or_language() :: :turkic | nil | %{language: atom()}
```

# `option`

```elixir
@type option() ::
  {:locale, String.t() | map()}
  | {:break, break_type()}
  | {:suppressions, boolean()}
```

# `split_option`

```elixir
@type split_option() ::
  {:locale, String.t() | map()}
  | {:break, break_type()}
  | {:suppressions, boolean()}
  | {:trim, boolean()}
```

# `string_interval`

```elixir
@type string_interval() :: {String.t(), String.t()}
```

# `break`

```elixir
@spec break(string_interval :: string_interval(), options :: [option()]) ::
  break_match() | error_return()
```

Returns match data indicating if the
requested break is applicable
at the point between the two string
segments represented by `{string_before, string_after}`.

## Arguments

* `string_interval` is any 2-tuple consisting
  of the string before a possible break and the string
  after a possible break.

* `options` is a keyword list of
  options.

## Options

* `:locale` is any locale returned by
  `Unicode.String.Segment.known_segmentation_locales/0` or
  `Unicode.String.Dictionary.known_dictionary_locales/0`.
  The default is "root" which corresponds
  to the break rules defined by the
  [Unicode Segmentation](https://unicode.org/reports/tr29/) rules.

* `:break` is the type of break. It is one of
  `:grapheme`, `:word`, `:line` or `:sentence`. The
  default is `:word`.

* `:suppressions` is a boolean which,
  if `true`, will suppress breaks for common
  abbreviations defined for the `locale`. The
  default is `true`.

## Returns

A tuple indicating if a break would
be applicable at this point between
`string_before` and `string_after`.

* `{:break, {string_before, {matched_string, remaining_string}}}` or

* `{:no_break, {string_before, {matched_string, remaining_string}}}` or

* `{:error, reason}`.

## Examples

    iex> Unicode.String.break {"This is ", "some words"}
    {:break, {"This is ", {"s", "ome words"}}}

    iex> Unicode.String.break {"This is ", "some words"}, break: :sentence
    {:no_break, {"This is ", {"s", "ome words"}}}

    iex> Unicode.String.break {"This is one. ", "This is some words."}, break: :sentence
    {:break, {"This is one. ", {"T", "his is some words."}}}

# `break?`

```elixir
@spec break?(string_interval :: string_interval(), options :: [option()]) ::
  boolean() | no_return()
```

Returns a boolean indicating if the
requested break is applicable
at the point between the two string
segments represented by `{string_before, string_after}`.

## Arguments

* `string_interval` is any 2-tuple consisting
  of the string before a possible break and the string
  after a possible break.

* `options` is a keyword list of
  options.

## Options

* `:locale` is any locale returned by
  `Unicode.String.Segment.known_segmentation_locales/0` or
  `Unicode.String.Dictionary.known_dictionary_locales/0`.
  The default is "root" which corresponds
  to the break rules defined by the
  [Unicode Segmentation](https://unicode.org/reports/tr29/) rules.

* `:break` is the type of break. It is one of
  `:grapheme`, `:word`, `:line` or `:sentence`. The
  default is `:word`.

* `:suppressions` is a boolean which,
  if `true`, will suppress breaks for common
  abbreviations defined for the `locale`. The
  default is `true`.

## Returns

* `true` or `false` or

* raises an exception if there is an error.

## Examples

    iex> Unicode.String.break? {"This is ", "some words"}
    true

    iex> Unicode.String.break? {"This is ", "some words"}, break: :sentence
    false

    iex> Unicode.String.break? {"This is one. ", "This is some words."}, break: :sentence
    true

# `downcase`
*since 1.3.0* 

```elixir
@spec downcase(String.t(), Keyword.t()) :: String.t()
```

Converts all characters in the given string to lower case
according to the Unicode Casing algorithm.

### Arguments

* `string` is any `t:String.t/0`.

* `options` is a keyword list of options.

### Options

* `:locale` is any [ISO 639](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
  language code or a [Localize.LanguageTag](https://hexdocs.pm/localize/Localize.LanguageTag.html)
  which provides integration with [localize](https://hex.pm/packages/localize)
  applications.  The default is `:any` which signifies the
  application of the base Unicode casing algorithm.

### Notes

* The locale option determines the use of certain locale-specific
  casing rules.  Where no specific casing rules apply to
  the given locale, the base Unicode casing algorithm is
  applied. The locales which have customized casing rules
  are returned by `Unicode.String.special_casing_locales/0`.

### Returns

* `downcased_string`

### Examples

    iex> Unicode.String.downcase("THE QUICK BROWN FOX")
    "the quick brown fox"

    # Lower case Greek with a final sigma
    iex> Unicode.String.downcase("ὈΔΥΣΣΕΎΣ", locale: :el)
    "ὀδυσσεύς"

    # Lower case in Turkish and Azeri correctly handles
    # undotted-i and undotted-I
    iex> Unicode.String.downcase("DİYARBAKIR", locale: :tr)
    "diyarbakır"

# `equals_ignoring_case?`

```elixir
@spec equals_ignoring_case?(String.t(), String.t(), mode_or_language()) :: boolean()
```

Compares two strings in a case insensitive
manner.

Case folding is applied to the two string
arguments which are then compared with the
`==` operator.

## Arguments

* `string_a` and `string_b` are two strings
  to be compared

## Returns

* `true` or `false`

## Notes

* This function applies the [Unicode Case Folding
  algorithm](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf)

* The algorithm does not apply any treatment to diacritical
  marks hence "compare strings without accents" is not
  part of this function.

* No string normalization is performed. Where the
  normalization state of the string cannot be guaranteed
  it is recommended they be normalized before comparison
  using `String.normalize(string, :nfc)`.

## Examples

    iex> Unicode.String.equals_ignoring_case? "ABC", "abc"
    true

    iex> Unicode.String.equals_ignoring_case? "beißen", "beissen"
    true

    iex> Unicode.String.equals_ignoring_case? "grüßen", "grussen"
    false

# `fold`

# `fold`

# `is_language`
*macro* 

# `is_script`
*macro* 

# `is_territory`
*macro* 

# `next`

```elixir
@spec next(string :: String.t(), split_options :: [split_option()]) ::
  String.t() | nil | error_return()
```

Returns next segment in a string.

## Arguments

* `string` is any `t:String.t/0`.

* `options` is a keyword list of
  options.

## Returns

A tuple with the segment and the remainder of the string or `""`
in case the String reached its end.

* `{next_string, rest_of_the_string}` or

* `{:error, reason}`

## Options

* `:locale` is any locale returned by
  `Unicode.String.Segment.known_segmentation_locales/0` or
  `Unicode.String.Dictionary.known_dictionary_locales/0` or
  a [Localize.LanguageTag](https://hexdocs.pm/localize/Localize.LanguageTag.html)
  struct. The default is "root" which corresponds
  to the break rules defined by the
  [Unicode Segmentation](https://unicode.org/reports/tr29/) rules.

* `:break` is the type of break. It is one of
  `:grapheme`, `:word`, `:line` or `:sentence`. The
  default is `:word`.

* `:suppressions` is a boolean which,
  if `true`, will suppress breaks for common
  abbreviations defined for the `locale`. The
  default is `true`.

## Examples

    iex> Unicode.String.next "This is a sentence. And another.", break: :word
    {"This", " is a sentence. And another."}

    iex> Unicode.String.next "This is a sentence. And another.", break: :sentence
    {"This is a sentence. ", "And another."}

# `special_casing_locales`

Returms a list of locales that have special
casing rules.

### Example

    iex> Unicode.String.special_casing_locales()
    [:az, :el, :lt, :nl, :tr]

# `split`

```elixir
@spec split(string :: String.t(), split_options :: [split_option()]) ::
  [String.t(), ...] | error_return()
```

Splits a string according to the
specified break type.

## Arguments

* `string` is any `t:String.t/0`.

* `options` is a keyword list of
  options.

## Returns

* A list of strings after applying the
  specified break rules or

* `{:error, reason}`

## Options

* `:locale` is any locale returned by
  `Unicode.String.Segment.known_segmentation_locales/0`  or
  `Unicode.String.Dictionary.known_dictionary_locales/0` or
  a [Localize.LanguageTag](https://hexdocs.pm/localize/Localize.LanguageTag.html)
  struct. The default is "root" which corresponds
  to the break rules defined by the
  [Unicode Segmentation](https://unicode.org/reports/tr29/) rules.

* `:break` is the type of break. It is one of
  `:grapheme`, `:word`, `:line` or `:sentence`. The
  default is `:word`.

* `:suppressions` is a boolean which,
  if `true`, will suppress breaks for common
  abbreviations defined for the `locale`. The
  default is `true`.

* `:trim` is a boolean indicating if segments
  the are comprised of only white space are to be
  excluded from the returned list.  The default
  is `false`.

## Examples

    iex> Unicode.String.split "This is a sentence. And another.", break: :word
    ["This", " ", "is", " ", "a", " ", "sentence", ".", " ", "And", " ", "another", "."]

    iex> Unicode.String.split "This is a sentence. And another.", break: :word, trim: true
    ["This", "is", "a", "sentence", ".", "And", "another", "."]

    iex> Unicode.String.split "This is a sentence. And another.", break: :sentence
    ["This is a sentence. ", "And another."]

# `splitter`

```elixir
@spec splitter(string :: String.t(), split_options :: [split_option()]) ::
  function() | error_return()
```

Returns an enumerable that splits a string on demand.

## Arguments

* `string` is any `t:String.t/0`.

* `options` is a keyword list of
  options.

## Returns

* A function that implements the enumerable
  protocol or

* `{:error, reason}`

## Options

* `:locale` is any locale returned by
  `Unicode.String.Segment.known_segmentation_locales/0` or
  `Unicode.String.Dictionary.known_dictionary_locales/0`.
  The default is "root" which corresponds
  to the break rules defined by the
  [Unicode Segmentation](https://unicode.org/reports/tr29/) rules.

* `:break` is the type of break. It is one of
  `:grapheme`, `:word`, `:line` or `:sentence`. The
  default is `:word`.

* `:suppressions` is a boolean which,
  if `true`, will suppress breaks for common
  abbreviations defined for the `locale`. The
  default is `true`.

* `:trim` is a boolean indicating if segments
  the are comprised of only white space are to be
  excluded from the returned list.  The default
  is `false`.

## Examples

    iex> enum = Unicode.String.splitter "This is a sentence. And another.", break: :word, trim: true
    iex> Enum.take enum, 3
    ["This", "is", "a"]

# `stream`
*since 1.2.0* 

```elixir
@spec stream(string :: String.t(), split_options :: [split_option()]) ::
  Enumerable.t() | {:error, String.t()}
```

Return a stream that breaks a string into
graphemes, words, sentences or line breaks.

## Arguments

* `string` is any `t:String.t/0`.

* `options` is a keyword list of
  options.

## Returns

* A stream that is an `t:Enumerable.t/0` that
  can be used with the functions in the `Stream`
  or `Enum` modules.

* `{:error, reason}`

## Options

* `:locale` is any locale returned by
  `Unicode.String.Segment.known_segmentation_locales/0` or
  `Unicode.String.Dictionary.known_dictionary_locales/0` or
  a [Localize.LanguageTag](https://hexdocs.pm/localize/Localize.LanguageTag.html)
  struct. The default is "root" which corresponds
  to the break rules defined by the
  [Unicode Segmentation](https://unicode.org/reports/tr29/) rules.

* `:break` is the type of break. It is one of
  `:grapheme`, `:word`, `:line` or `:sentence`. The
  default is `:word`.

* `:suppressions` is a boolean which,
  if `true`, will suppress breaks for common
  abbreviations defined for the `locale`. The
  default is `true`.

* `:trim` is a boolean indicating if segments
  the are comprised of only white space are to be
  excluded from the returned list.  The default
  is `false`.

## Examples

  iex> Enum.to_list Unicode.String.stream("this is a set of words", trim: true)
  ["this", "is", "a", "set", "of", "words"]

  iex> Enum.to_list Unicode.String.stream("this is a set of words", break: :sentence, trim: true)
  ["this is a set of words"]

# `titlecase`
*since 1.3.0* 

```elixir
@spec titlecase(String.t(), Keyword.t()) :: String.t()
```

Converts the given string to title case
according to the Unicode Casing algorithm.

Title casing is the process of transforming
the first character of each word in a string
to upper case and the following characters
in the word to lower case.

As a result this algorithm does not conform
to the norms of all languages and cultures.
However special processing is performed for
the Dutch dipthong "IJ" when using the `:nl`
casing locale.

Further work will focus on improving title
casing of Greek dipthongs.

### Arguments

* `string` is any `t:String.t/0`.

* `options` is a keyword list of options.

### Options

* `:locale` is any [ISO 639](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
  language code or a [Localize.LanguageTag](https://hexdocs.pm/localize/Localize.LanguageTag.html)
  which provides integration with [localize](https://hex.pm/packages/localize)
  applications.  The default is `:any` which signifies the
  application of the base Unicode casing algorithm.

### Notes

* The locale option determines the use of certain locale-specific
  casing rules.  Where no specific casing rules apply to
  the given locale, the base Unicode casing algorithm is
  applied. The locales which have customized casing rules
  are returned by `Unicode.String.special_casing_locales/0`.

* The string is broken into words using
  `Unicode.String.break/2` which implements the
  [Unicode segmentation algorithm](https://unicode.org/reports/tr29/).

### Returns

* `title_cased_string`.

### Examples

    iex> Unicode.String.titlecase("THE QUICK BROWN FOX")
    "The Quick Brown Fox"

    # Title case Dutch with leading dipthong
    iex> Unicode.String.titlecase("ijsselmeer", locale: :nl)
    "IJsselmeer"

# `upcase`
*since 1.3.0* 

```elixir
@spec upcase(String.t(), Keyword.t()) :: String.t()
```

Converts all characters in the given string to upper case
according to the Unicode Casing algorithm.

### Arguments

* `string` is any `t:String.t/0`.

* `options` is a keyword list of options.

### Options

* `:locale` is any [ISO 639](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
  language code or a [Localize.LanguageTag](https://hexdocs.pm/localize/Localize.LanguageTag.html)
  which provides integration with [localize](https://hex.pm/packages/localize)
  applications.  The default is `:any` which signifies the
  application of the base Unicode casing algorithm.

### Notes

* The locale option determines the use of certain locale-specific
  casing rules.  Where no specific casing rules apply to
  the given locale, the base Unicode casing algorithm is
  applied. The locales which have customized casing rules
  are returned by `Unicode.String.special_casing_locales/0`.

### Returns

* `downcased_string`

### Examples

    # Basic case transformation
    iex> Unicode.String.upcase("the quick brown fox")
    "THE QUICK BROWN FOX"

    # Dotted-I in Turkish and Azeri
    iex> Unicode.String.upcase("Diyarbakır", locale: :tr)
    "DİYARBAKIR"

    # Upper case in Greek removes diacritics
    iex> Unicode.String.upcase("Πατάτα, Αέρας, Μυστήριο", locale: :el)
    "ΠΑΤΑΤΑ, ΑΕΡΑΣ, ΜΥΣΤΗΡΙΟ"

---

*Consult [api-reference.md](api-reference.md) for complete listing*