unicode_data v0.2.1 UnicodeData

Provides access to Unicode properties needed for more complex text processing.

Script detection

Proper text layout requires knowing which script is in use for a run of text. Unicode provides the Script property to identify the script associated with a codepoint. The script short name is also provided, which can be passed to font engines or cross-referenced with ISO 15924.

Once the script is identified, it’s possible to determine if the script is a right-to-left script, as well as what additional support might be required for proper layout.

Shaping support

The Joining_Type and Joining_Group properties provide support for shaping engines doing layout of cursive scripts.

Layout support

Bidirectional algorithms such the one in UAX #9 require access to several Unicode properties in order to properly layout paragraphs where the direction of the text is not uniform — for example, when embedding an English word into a Hebrew paragraph.

The Bidi_Class, Bidi_Mirroring_Glyph, Bidi_Mirrored, Bidi_Paired_Bracket, and Bidi_Paired_Bracket_Type properties are specifically provided to allow for implementation of the Unicode bidirectional algorithm described in UAX #9.

Link to this section Summary

Functions

Determine the bidirectional character type of a character

The Bidi_Mirroring_Glyph property returns the character suitable for character-based mirroring, if one exists. Otherwise, it returns nil

The Bidi_Mirrored property indicates whether or not there is another Unicode character that typically has a glyph that is the mirror image of the original character’s glyph

The Bidi_Paired_Bracket property is used to establish pairs of opening and closing brackets for the purposes of the Unicode bidirectional algorithm

The Unicode Bidi_Paired_Bracket_Type property classifies characters into opening and closing paired brackets for the purposes of the Unicode bidirectional algorithm

Determine the joining group for cursive scripts

Determine the joining type for cursive scripts

Determine if the script is written right-to-left

Lookup the script property associated with a codepoint

Get the short name associated with a script. This is the tag used to identify scripts in OpenType fonts and generally matches the script code defined in ISO 15942

Determine if a script uses the Joining Type property to select contextual forms

Link to this section Functions

Link to this function bidi_class(codepoint)
bidi_class(integer | String.codepoint) :: String.t

Determine the bidirectional character type of a character.

This is used to initialize the Unicode bidirectional algorithm, published in UAX #9.

There are several blocks of unassigned code points which are reserved to specific script blocks and therefore return a specific bidirectional character type. For example, unassigned code point , in the Arabic block, has type “AL”.

If not specifically assigned or reserved, the default value is “L” (Left-to-Right).

This is sourced from DerivedBidiClass.txt

Examples

iex> UnicodeData.bidi_class("A")
"L"
iex> UnicodeData.bidi_class("د")
"AL"
iex> UnicodeData.bidi_class("𐭀")
"R"
iex> UnicodeData.bidi_class("﹵")
"AL"
Link to this function bidi_mirror_codepoint(codepoint)
bidi_mirror_codepoint(integer | String.codepoint) ::
  String.codepoint |
  nil

The Bidi_Mirroring_Glyph property returns the character suitable for character-based mirroring, if one exists. Otherwise, it returns nil.

Character-based mirroring is used by the Unicode bidirectional algorithm. A layout engine may want to consider other method of mirroring.

This is sourced from BidiMirroring.txt

Examples

iex> UnicodeData.bidi_mirror_codepoint("[")
"]"
iex> UnicodeData.bidi_mirror_codepoint("A")
nil
Link to this function bidi_mirrored?(codepoint)
bidi_mirrored?(integer | String.codepoint) :: boolean

The Bidi_Mirrored property indicates whether or not there is another Unicode character that typically has a glyph that is the mirror image of the original character’s glyph.

Character-based mirroring is used by the Unicode bidirectional algorithm. A layout engine may want to consider other method of mirroring.

Some characters like ∛ (CUBE ROOT) claim to be mirrored but do not actually have a corresponding mirror character - in those cases this function returns false.

This is sourced from BidiMirroring.txt

Examples

iex> UnicodeData.bidi_mirrored?("A")
false
iex> UnicodeData.bidi_mirrored?("[")
true
iex> UnicodeData.bidi_mirrored?("∛")
false
Link to this function bidi_paired_bracket(codepoint)
bidi_paired_bracket(integer | String.codepoint) ::
  String.codepoint |
  nil

The Bidi_Paired_Bracket property is used to establish pairs of opening and closing brackets for the purposes of the Unicode bidirectional algorithm.

If a character is an opening or closing bracket, this will return the other character in the pair. Otherwise, it returns nil.

For example

This is sourced from BidiBrackets.txt

Examples

iex> UnicodeData.bidi_paired_bracket("[")
"]"
iex> UnicodeData.bidi_paired_bracket("]")
"["
iex> UnicodeData.bidi_paired_bracket("A")
nil
Link to this function bidi_paired_bracket_type(codepoint)
bidi_paired_bracket_type(integer | String.codepoint) :: String.t

The Unicode Bidi_Paired_Bracket_Type property classifies characters into opening and closing paired brackets for the purposes of the Unicode bidirectional algorithm.

It returns one of the following values:

  • o Open - The character is classified as an opening bracket.
  • c Close - The character is classified as a closing bracket.
  • n None - the character is not a paired bracket character.

This is sourced from BidiBrackets.txt

Examples

iex> UnicodeData.bidi_paired_bracket_type("[")
"o"
iex> UnicodeData.bidi_paired_bracket_type("}")
"c"
iex> UnicodeData.bidi_paired_bracket_type("A")
"n"
Link to this function joining_group(codepoint)
joining_group(integer | String.codepoint) :: String.t

Determine the joining group for cursive scripts.

Characters from other scripts return No_Joining_Group as they do not participate in cursive shaping.

The ALAPH and DALATH RISH joining groups are of particular interest to shaping engines dealing with Syriac. Chapter 9.3 of the Unicode Standard discusses Syriac shaping in detail.

This is sourced from ArabicShaping.txt

Examples

iex> UnicodeData.joining_group("ك")
"KAF"
iex> UnicodeData.joining_group("د")
"DAL"
iex> UnicodeData.joining_group("ܐ")
"ALAPH"
Link to this function joining_type(codepoint)
joining_type(integer | String.codepoint) :: String.t

Determine the joining type for cursive scripts.

Cursive scripts have the following join types:

  • R Right_Joining (top-joining for vertical)
  • L Left_Joining (bottom-joining for vertical)
  • D Dual_Joining
  • C Join_Causing
  • U Non_Joining
  • T Transparent

Characters from other scripts return U as they do not participate in cursive shaping.

This is sourced from ArabicShaping.txt

Examples

iex> UnicodeData.joining_type("ك")
"D"
iex> UnicodeData.joining_type("د")
"R"
iex> UnicodeData.joining_type("ܐ")
"R"
Link to this function right_to_left?(script)
right_to_left?(String.t) :: boolean

Determine if the script is written right-to-left.

This data is derived from ISO 15924. There’s a handy sortable table on the Wikipedia page for ISO 15924.

Examples

iex> UnicodeData.right_to_left?("Latin")
false
iex> UnicodeData.right_to_left?("Arabic")
true

You can also pass the script short name.

iex> UnicodeData.right_to_left?("adlm")
true
Link to this function script_from_codepoint(codepoint)
script_from_codepoint(integer | String.codepoint) :: String.t

Lookup the script property associated with a codepoint.

This will return the script property value. In addition to the explicitly defined scripts, there are three special values.

  • Characters with script value Inherited inherit the script of the preceding character.
  • Characters with script value Common are used in multiple scripts.
  • Characters of Unknown script are unassigned, private use, noncharacter or surrogate code points.

This is sourced from Scripts.txt

Examples

iex> UnicodeData.script_from_codepoint("a")
"Latin"
iex> UnicodeData.script_from_codepoint("9")
"Common"
iex> UnicodeData.script_from_codepoint("ك")
"Arabic"
Link to this function script_to_tag(script)
script_to_tag(String.t) :: String.t

Get the short name associated with a script. This is the tag used to identify scripts in OpenType fonts and generally matches the script code defined in ISO 15942.

See Annex #24 for more about the relationship between Unicode and ISO 15942.

Data from OpenType script tags and PropertyValueAliases.txt

Examples

iex> UnicodeData.script_to_tag("Latin")
"latn"
iex> UnicodeData.script_to_tag("Unknown")
"zzzz"
iex> UnicodeData.script_to_tag("Adlam")
"adlm"
Link to this function uses_joining_type?(script)
uses_joining_type?(String.t) :: boolean

Determine if a script uses the Joining Type property to select contextual forms.

Typically this is used to select a shaping engine, which will then call joining_type/1 and joining_group/1 to do cursive shaping.

Examples

iex> UnicodeData.uses_joining_type?("Latin")
false
iex> UnicodeData.uses_joining_type?("Arabic")
true
iex> UnicodeData.uses_joining_type?("Nko")
true

You can also pass the script short name.

iex> UnicodeData.uses_joining_type?("syrc")
true