Unicode.String (Unicode String v1.7.0)

This module provides functions that implement some of the Unicode standards:

The Unicode Case Mapping algorithm to provide mapping to upper, lower and title case text.
The Unicode Case Folding algorithm to provide case-independent equality checking irrespective of language or script.
The Unicode Segmentation algorithm to detect, break or split strings into grapheme clusters, words and sentences.
The Unicode Line Breaking algorithm to determine line break placement to support word-wrapping.

Summary

Types

break_match()

break_or_no_break()

break_type()

error_return()

mode_or_language()

option()

split_option()

string_interval()

Functions

break(arg, options \\ [])

Returns match data indicating if the requested break is applicable at the point between the two string segments represented by {string_before, string_after}.

break?(arg, options \\ [])

Returns a boolean indicating if the requested break is applicable at the point between the two string segments represented by {string_before, string_after}.

downcase(string, options \\ [])

Converts all characters in the given string to lower case according to the Unicode Casing algorithm.

equals_ignoring_case?(string_a, string_b, mode_or_language_tag \\ nil)

Compares two strings in a case insensitive manner.

find_matching_locale(candidates, known_locales, default)

fold(string)

See Unicode.String.Case.Folding.fold/1.

fold(string, type)

See Unicode.String.Case.Folding.fold/2.

is_language(language)

is_script(script)

is_territory(territory)

next(string, options \\ [])

Returns next segment in a string.

special_casing_locales()

Returms a list of locales that have special casing rules.

split(string, options \\ [])

Splits a string according to the specified break type.

splitter(string, options)

Returns an enumerable that splits a string on demand.

stream(string, options \\ [])

Return a stream that breaks a string into graphemes, words, sentences or line breaks.

titlecase(string, options \\ [])

Converts the given string to title case according to the Unicode Casing algorithm.

upcase(string, options \\ [])

Converts all characters in the given string to upper case according to the Unicode Casing algorithm.

Types

break_match()

@type break_match() ::
  {break_or_no_break(), {String.t(), {String.t(), String.t()}}}
  | {break_or_no_break(), {String.t(), String.t()}}

break_or_no_break()

@type break_or_no_break() :: :break | :no_break

break_type()

@type break_type() :: :grapheme | :word | :line | :sentence

error_return()

@type error_return() :: {:error, String.t()}

mode_or_language()

@type mode_or_language() :: :turkic | nil | %{language: atom()}

option()

@type option() ::
  {:locale, String.t() | map()}
  | {:break, break_type()}
  | {:suppressions, boolean()}

split_option()

@type split_option() ::
  {:locale, String.t() | map()}
  | {:break, break_type()}
  | {:suppressions, boolean()}
  | {:trim, boolean()}

string_interval()

@type string_interval() :: {String.t(), String.t()}

Functions

break(arg, options \\ [])

@spec break(string_interval :: string_interval(), options :: [option()]) ::
  break_match() | error_return()

Returns match data indicating if the requested break is applicable at the point between the two string segments represented by {string_before, string_after}.

Arguments

string_interval is any 2-tuple consisting of the string before a possible break and the string after a possible break.
options is a keyword list of options.

Options

:locale is any locale returned by Unicode.String.Segment.known_segmentation_locales/0 or Unicode.String.Dictionary.known_dictionary_locales/0. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.
:break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.
:suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.

Returns

A tuple indicating if a break would be applicable at this point between string_before and string_after.

{:break, {string_before, {matched_string, remaining_string}}} or
{:no_break, {string_before, {matched_string, remaining_string}}} or
{:error, reason}.

Examples

iex> Unicode.String.break {"This is ", "some words"}
{:break, {"This is ", {"s", "ome words"}}}

iex> Unicode.String.break {"This is ", "some words"}, break: :sentence
{:no_break, {"This is ", {"s", "ome words"}}}

iex> Unicode.String.break {"This is one. ", "This is some words."}, break: :sentence
{:break, {"This is one. ", {"T", "his is some words."}}}

break?(arg, options \\ [])

@spec break?(string_interval :: string_interval(), options :: [option()]) ::
  boolean() | no_return()

Returns a boolean indicating if the requested break is applicable at the point between the two string segments represented by {string_before, string_after}.

Arguments

string_interval is any 2-tuple consisting of the string before a possible break and the string after a possible break.
options is a keyword list of options.

Options

:locale is any locale returned by Unicode.String.Segment.known_segmentation_locales/0 or Unicode.String.Dictionary.known_dictionary_locales/0. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.
:break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.
:suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.

Returns

true or false or
raises an exception if there is an error.

Examples

iex> Unicode.String.break? {"This is ", "some words"}
true

iex> Unicode.String.break? {"This is ", "some words"}, break: :sentence
false

iex> Unicode.String.break? {"This is one. ", "This is some words."}, break: :sentence
true

downcase(string, options \\ [])

(since 1.3.0)

@spec downcase(String.t(), Keyword.t()) :: String.t()

Converts all characters in the given string to lower case according to the Unicode Casing algorithm.

Arguments

string is any String.t/0.
options is a keyword list of options.

Options

:locale is any ISO 639 language code or a LanguageTag which provides integration with ex_cldr applications. The default is :any which signifies the application of the base Unicode casing algorithm.

Notes

The locale option determines the use of certain locale-specific casing rules. Where no specific casing rules apply to the given locale, the base Unicode casing algorithm is applied. The locales which have customized casing rules are returned by Unicode.String.special_casing_locales/0.

Returns

downcased_string

Examples

iex> Unicode.String.downcase("THE QUICK BROWN FOX")
"the quick brown fox"

# Lower case Greek with a final sigma
iex> Unicode.String.downcase("ὈΔΥΣΣΕΎΣ", locale: :el)
"ὀδυσσεύς"

# Lower case in Turkish and Azeri correctly handles
# undotted-i and undotted-I
iex> Unicode.String.downcase("DİYARBAKIR", locale: :tr)
"diyarbakır"

equals_ignoring_case?(string_a, string_b, mode_or_language_tag \\ nil)

@spec equals_ignoring_case?(String.t(), String.t(), mode_or_language()) :: boolean()

Compares two strings in a case insensitive manner.

Case folding is applied to the two string arguments which are then compared with the == operator.

Arguments

string_a and string_b are two strings to be compared

Returns

true or false

Notes

This function applies the Unicode Case Folding algorithm
The algorithm does not apply any treatment to diacritical marks hence "compare strings without accents" is not part of this function.
No string normalization is performed. Where the normalization state of the string cannot be guaranteed it is recommended they be normalized before comparison using String.normalize(string, :nfc).

Examples

iex> Unicode.String.equals_ignoring_case? "ABC", "abc"
true

iex> Unicode.String.equals_ignoring_case? "beißen", "beissen"
true

iex> Unicode.String.equals_ignoring_case? "grüßen", "grussen"
false

find_matching_locale(candidates, known_locales, default)

fold(string)

See Unicode.String.Case.Folding.fold/1.

fold(string, type)

See Unicode.String.Case.Folding.fold/2.

is_language(language)

(macro)

is_script(script)

(macro)

is_territory(territory)

(macro)

next(string, options \\ [])

@spec next(string :: String.t(), split_options :: [split_option()]) ::
  String.t() | nil | error_return()

Returns next segment in a string.

Arguments

string is any String.t/0.
options is a keyword list of options.

Returns

A tuple with the segment and the remainder of the string or "" in case the String reached its end.

{next_string, rest_of_the_string} or
{:error, reason}

Options

:locale is any locale returned by Unicode.String.Segment.known_segmentation_locales/0 or Unicode.String.Dictionary.known_dictionary_locales/0 or a Cldr.LanguageTag struct. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.
:break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.
:suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.

Examples

iex> Unicode.String.next "This is a sentence. And another.", break: :word
{"This", " is a sentence. And another."}

iex> Unicode.String.next "This is a sentence. And another.", break: :sentence
{"This is a sentence. ", "And another."}

special_casing_locales()

Returms a list of locales that have special casing rules.

Example

iex> Unicode.String.special_casing_locales()
[:az, :el, :lt, :nl, :tr]

split(string, options \\ [])

@spec split(string :: String.t(), split_options :: [split_option()]) ::
  [String.t(), ...] | error_return()

Splits a string according to the specified break type.

Arguments

string is any String.t/0.
options is a keyword list of options.

Returns

A list of strings after applying the specified break rules or
{:error, reason}

Options

:locale is any locale returned by Unicode.String.Segment.known_segmentation_locales/0 or Unicode.String.Dictionary.known_dictionary_locales/0 or a Cldr.LanguageTag struct. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.
:break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.
:suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.
:trim is a boolean indicating if segments the are comprised of only white space are to be excluded from the returned list. The default is false.

Examples

iex> Unicode.String.split "This is a sentence. And another.", break: :word
["This", " ", "is", " ", "a", " ", "sentence", ".", " ", "And", " ", "another", "."]

iex> Unicode.String.split "This is a sentence. And another.", break: :word, trim: true
["This", "is", "a", "sentence", ".", "And", "another", "."]

iex> Unicode.String.split "This is a sentence. And another.", break: :sentence
["This is a sentence. ", "And another."]

splitter(string, options)

@spec splitter(string :: String.t(), split_options :: [split_option()]) ::
  function() | error_return()

Returns an enumerable that splits a string on demand.

Arguments

string is any String.t/0.
options is a keyword list of options.

Returns

A function that implements the enumerable protocol or
{:error, reason}

Options

:locale is any locale returned by Unicode.String.Segment.known_segmentation_locales/0 or Unicode.String.Dictionary.known_dictionary_locales/0. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.
:break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.
:suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.
:trim is a boolean indicating if segments the are comprised of only white space are to be excluded from the returned list. The default is false.

Examples

iex> enum = Unicode.String.splitter "This is a sentence. And another.", break: :word, trim: true
iex> Enum.take enum, 3
["This", "is", "a"]

stream(string, options \\ [])

(since 1.2.0)

@spec stream(string :: String.t(), split_options :: [split_option()]) ::
  Enumerable.t() | {:error, String.t()}

Return a stream that breaks a string into graphemes, words, sentences or line breaks.

Arguments

string is any String.t/0.
options is a keyword list of options.

Returns

A stream that is an Enumerable.t/0 that can be used with the functions in the Stream or Enum modules.
{:error, reason}

Options

:locale is any locale returned by Unicode.String.Segment.known_segmentation_locales/0 or Unicode.String.Dictionary.known_dictionary_locales/0 or a Cldr.LanguageTag struct. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.
:break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.
:suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.
:trim is a boolean indicating if segments the are comprised of only white space are to be excluded from the returned list. The default is false.

Examples

iex> Enum.to_list Unicode.String.stream("this is a set of words", trim: true) ["this", "is", "a", "set", "of", "words"]

iex> Enum.to_list Unicode.String.stream("this is a set of words", break: :sentence, trim: true) ["this is a set of words"]

titlecase(string, options \\ [])

(since 1.3.0)

@spec titlecase(String.t(), Keyword.t()) :: String.t()

Converts the given string to title case according to the Unicode Casing algorithm.

Title casing is the process of transforming the first character of each word in a string to upper case and the following characters in the word to lower case.

As a result this algorithm does not conform to the norms of all languages and cultures. However special processing is performed for the Dutch dipthong "IJ" when using the :nl casing locale.

Further work will focus on improving title casing of Greek dipthongs.

Arguments

string is any String.t/0.
options is a keyword list of options.

Options

:locale is any ISO 639 language code or a LanguageTag which provides integration with ex_cldr applications. The default is :any which signifies the application of the base Unicode casing algorithm.

Notes

The locale option determines the use of certain locale-specific casing rules. Where no specific casing rules apply to the given locale, the base Unicode casing algorithm is applied. The locales which have customized casing rules are returned by Unicode.String.special_casing_locales/0.
The string is broken into words using Unicode.String.break/2 which implements the Unicode segmentation algorithm.

Returns

title_cased_string.

Examples

iex> Unicode.String.titlecase("THE QUICK BROWN FOX")
"The Quick Brown Fox"

# Title case Dutch with leading dipthong
iex> Unicode.String.titlecase("ijsselmeer", locale: :nl)
"IJsselmeer"

upcase(string, options \\ [])

(since 1.3.0)

@spec upcase(String.t(), Keyword.t()) :: String.t()

Converts all characters in the given string to upper case according to the Unicode Casing algorithm.

Arguments

string is any String.t/0.
options is a keyword list of options.

Options

:locale is any ISO 639 language code or a LanguageTag which provides integration with ex_cldr applications. The default is :any which signifies the application of the base Unicode casing algorithm.

Notes

The locale option determines the use of certain locale-specific casing rules. Where no specific casing rules apply to the given locale, the base Unicode casing algorithm is applied. The locales which have customized casing rules are returned by Unicode.String.special_casing_locales/0.

Returns

downcased_string

Examples

# Basic case transformation
iex> Unicode.String.upcase("the quick brown fox")
"THE QUICK BROWN FOX"

# Dotted-I in Turkish and Azeri
iex> Unicode.String.upcase("Diyarbakır", locale: :tr)
"DİYARBAKIR"

# Upper case in Greek removes diacritics
iex> Unicode.String.upcase("Πατάτα, Αέρας, Μυστήριο", locale: :el)
"ΠΑΤΑΤΑ, ΑΕΡΑΣ, ΜΥΣΤΗΡΙΟ"