View Source Unicode.Set (Unicode Set v1.3.0)

Usage

Function guards

This is helpful in defining function guards. For example:

defmodule Guards do
  require Unicode.Set

  # Define a guard that checks if a codepoint is a unicode digit
  defguard digit?(x) when Unicode.Set.match?(x, "[[:Nd:]]")
end

defmodule MyModule do
  require Guards

  # Define a function using the previously defined guard
  def my_function(<< x :: utf8, _rest :: binary>>) when Guards.digit?(x) do
    IO.puts "Its a digit!"
  end

  # Define a guard directly on the function
  def my_other_function_(<< x :: utf8, _rest :: binary>>) when Unicode.Set.match?(x, "[[:Nd:]]") do
    IO.puts "Its also a digit!"
  end
end

Generating compiled patterns for String matching

String.split/3 and String.replace/3 allow for patterns and compiled patterns to be used with compiled patterns being the more performant approach. Unicode Set supports the generation of patterns and compiled patterns:

iex> pattern = Unicode.Set.compile_pattern!("[[:digit:]]")
iex> list = String.split("abc1def2ghi3jkl", pattern)
["abc", "def", "ghi", "jkl"]

Generating NimbleParsec ranges

The parser generator nimble_parsec allows a list of codepoint ranges as parameters to several combinators. Unicode Set can generate such ranges:

iex> Unicode.Set.to_utf8_char!("[[^abcd][mnb]]")
[98, 109..110, {:not, 97..100}]

This can be used as shown in the following example:

defmodule MyCombinators do
  import NimbleParsec

  @digit_list = Unicode.Set.to_utf8_char!("[[:digit:]]")
  def unicode_digit do
    utf8_char(@digit_list)
    |> label("a digit in any Unicode script")
  end
end

Compiling extended regular expressions

The Regex module supports a limited set of Unicode Sets. The Unicode.Regex module provides compile/2 and compile!/2 functions that have the same arguments and compatible functionality with Regexp.compile/2 other that they pre-process the regular expression, expanding any Unicode Sets. This makes it simple to incorporate Unicode Sets in regular expressions.

All Unicode Sets are expanded, even those that are known to Regex.compile/2 since the erlang :re module upon Regex is based does not always keep pace with Unicode releases.

For example:

iex> Unicode.Regex.compile("\\p{Zs}")
{:ok, ~r/[\x{20}\x{A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}]/u}

iex> Unicode.Regex.compile("[:graphic:]")
{:ok,
 ~r/[\x{20}-\x{7E}\x{A0}-\x{AC}\x{AE}-\x{377}\x{37A}-\x{37F}...]/u}

Other Examples

These examples show how to combine sets (union, difference and intersection) to deliver a flexible targeting of the required match.

# The character "๓" is the thai digit `1`
iex> Unicode.Set.match? ?๓, "[[:digit:]]"
true

# Set operations allow union, insersection and difference
# This example matches on digits, but not the Thai script
iex> Unicode.Set.match? ?๓, "[[:digit:]-[:thai:]]"
false

Compile time parsing

As much work as possible is done at compile time in order to deliver good performance. The macro Unicode.Set.match?/2 parses the unicode set, expands the require codepoints and generates guard clauses at compile time. The resulting code is a simple set of boolean operators that executes quickly at runtime.

Supported Unicode properties

This version of Unicode Set supports the following enumerable unicode properties in unicode sets:

  • script such as [:script=arabic:], \p{script=arabic} or [:arabic:]
  • block such as [:block=sudanese:], \p{block=sudanese}, \p{IsSudanese} or [:IsSudanese:]
  • general category such as [:Lu:], \p{Lu}, [:gc=Lu:] or [:general category=Lu:]
  • combining class such as [:ccc=230:]

In addition, the following boolean properties are supported. These are expressed as [:white space:] or \p{White Space}.

PropertyPropertyPropertyProperty
alphabeticascii_hex_digitbidi_controlcased
changes_when_casemappedchanges_when_lowercasedchanges_when_titlecasedchanges_when_uppercased
dashdefault_ignorable_code_pointdeprecateddiacritic
extendergrapheme_basegrapheme_extendgrapheme_link
hex_digithyphenid_continueid_start
ideographicids_binary_operatorids_trinary_operatorjoin_control
logical_order_exceptionlowercasemathnoncharacter_code_point
other_alphabeticother_default_ignorable_code_pointother_grapheme_extendother_id_continue
other_id_startother_lowercaseother_mathother_uppercase
pattern_syntaxpattern_white_spaceprepended_concatenation_markquotation_mark
radicalregional_indicatorsentence_terminalsoft_dotted
terminal_punctuationunified_ideographuppercasevariation_selector
white_spacexid_continuexid_startchanges_when_casefolded

In all cases, property names and property values may include whitespace and mixed case notation.

General Categories

AbbreviationLong Form
LLetter
LuUppercase Letter
LlLowercase Letter
LtTitlecase Letter
LmModifier Letter
LoOther Letter
MMark
MnNon-Spacing Mark
McSpacing Combining Mark
MeEnclosing Mark
NNumber
NdDecimal Digit Number
NlLetter Number
NoOther Number
SSymbol
SmMath Symbol
ScCurrency Symbol
SkModifier Symbol
SoOther Symbol
PPunctuation
PcConnector Punctuation
PdDash Punctuation
PsOpen Punctuation
PeClose Punctuation
PiInitial Punctuation
PfFinal Punctuation
PoOther Punctuation
ZSeparator
ZsSpace Separator
ZlLine Separator
ZpParagraph Separator
COther
CcControl
CfFormat
CsSurrogate
CoPrivate Use
CnUnassigned
Derived CategoriesLong Form
AnyAny all code points [\u{0}-\u{10FFFF}]
AssignedAssigned all assigned characters meaning \P{Cn}
ASCIIASCII all ASCII characters [\u{0}-\u{7F}]

Compatibility Property Names

PropertyUnicode CategoryComments
alpha\p{Alphabetic}Alphabetic includes more than gc = Letter. Note that combining marks (Me, Mn, Mc) are required for words of many languages. While they could be applied to non-alphabetics, their principal use is on alphabetics. Alphabetic should not be used as an approximation for word boundaries: see word below.
lower\p{Lowercase}Lowercase includes more than gc = Lowercase_Letter (Ll).
upper\p{Uppercase}Uppercase includes more than gc = Uppercase_Letter (Lu).
punct\p{gc=Punctuation} \p{gc=Symbol} - \p{alpha}Punctuation and symbols.
digit\p{gc=Decimal_Number}[0..9] Non-decimal numbers (like Roman numerals) are normally excluded.
xdigit\p{gc=Decimal_Number} \p{Hex_Digit}[0-9 A-F a-f] Hex_Digit contains 0-9 A-F, fullwidth and halfwidth, upper and lowercase.
alnum\p{alpha} \p{digit}Simple combination of other properties
space\p{Whitespace}
blank\p{gc=Space_Separator} \N{CHARACTER TABULATION}"horizontal" whitespace: space separators plus U+0009 tab.
cntrl\p{gc=Control}The characters in \p{gc=Format} share some, but not all aspects of control characters. Many format characters are required in the representation of plain text.
graph[^\p{space} \p{gc=Control} \p{gc=Surrogate} \p{gc=Unassigned}]Warning: the set shown here is defined by excluding space, controls, and so on with ^.
print\p{graph} \p{blank} -- \p{cntrl}Includes graph and space-like characters.
word\p{alpha} \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} \p{Join_Control}This is only an approximation to Word Boundaries. The Connector Punctuation is added in for programming language identifiers, thus adding _ and similar characters.

Additional Derived properties

In addition to the Unicode properties, some additional properties are also defined for convenience. These properties related to quote marks and are:

  • quote_mark
  • quote_mark_left
  • quote_mark_right
  • quote_mark_ambidextrous
  • quote_mark_single
  • quote_mark_double

As above these properties can be expressed in mixed case with spaces and underscores inserted for readability. They can be used in the same way as any Unicode property name.

Example Unicode Sets

Here are a few examples of sets. Although elements of the syntax appear similar to regular expressions, unicode sets only expresses one or more ranges of unicode codepoints.

PatternDescription
[a-z]The lower case letters a through z
[abc123]The six characters a,b,c,1,2 and 3
[\p{Letter}]All characters with the Unicode General Category of Letter

String Values

In addition to being a set of characters (of Unicode code points), a UnicodeSet may also contain string values. Conceptually, the UnicodeSet is always a set of strings, not a set of characters, although in many common use cases the strings are all of length one, which reduces to being a set of characters.

This concept can be confusing when first encountered, probably because similar set constructs from other environments (regular expressions) can only contain characters.

Unicode Set Patterns

Patterns are a series of characters bounded by square brackets that contain lists of characters and Unicode property sets. Lists are a sequence of characters that may have ranges indicated by a '-' between two characters, as in "a-z". The sequence specifies the range of all characters from the left to the right, in Unicode order. For example, [a c d-f m] is equivalent to [a c d e f m]. Whitespace can be freely used for clarity as [a c d-f m] means the same as [acd-fm].

Unicode property sets are specified by a Unicode property, such as [:Letter:]. For a list of supported properties, see the Properties section. For details on the use of short vs. long property and property value names, see the end of this section. The syntax for specifying the property names is an extension of either POSIX or Perl syntax with the addition of =value. For example, you can match letters by using the POSIX syntax [:Letter:], or by using the Perl-style syntax \p{Letter}. The type can be omitted for the Category and Script properties, but is required for other properties.

The table below shows the two kinds of syntax: POSIX and Perl style. Also, the table shows the "Negative", which is a property that excludes all characters of a given kind. For example, [:^Letter:] matches all characters that are not [:Letter:].

StylePositiveNegative
POSIX-style Syntax[:type=value:][:^type=value:]
Perl-style Syntax\p{type=value}\P{type=value}

These following low-level lists or properties then can be freely combined with the normal set operations (union, inverse, difference, and intersection):

ExampleMeaning
A B [[:letter:] [:number:]]To union two sets A and B, simply concatenate them
A & B [[:letter:] & [a-z]]To intersect two sets A and B, use the '&' operator.
A - B [[:letter:] - [a-z]]To take the set-difference of two sets A and B, use the '-' operator.
[^A] [^a-z]To invert a set A, place a ^ immediately after the opening [. Note that the complement only affects code points, not string values. In any other location, the ^ does not have a special meaning.

Precedence

The binary operators of union, intersection, and set-difference have equal precedence and bind left-to-right. Thus the following are equivalent:

  • [[:letter:] - [a-z] [:number:] & [\u0100-\u01FF]]
  • [[[[[:letter:] - [a-z]] [:number:]] & [\u0100-\u01FF]]

Another example is that the set [[ace][bdf] - [abc][def]] is not the empty set, but instead the set [def]. That is because the syntax corresponds to the following UnicodeSet operations:

  1. start with [ace]
  2. union [bdf] -- we now have [abcdef]
  3. subtract [abc] -- we now have [def]
  4. union [def] -- no effect, we still have [def]

This only really matters where there are the difference and intersection operations, as the union operation is commutative. To make sure that the - is the main operator, add brackets to group the operations as desired, such as [[ace][bdf] - [[abc][def]]].

Another caveat with the & and - operators is that they operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example, the pattern [[:Lu:]-A] is illegal, since it is interpreted as the set [:Lu:] followed by the incomplete range -A. To specify the set of uppercase letters except for A, enclose the A in a set: [[:Lu:]-[A]].

Examples

  • [a] The set containing 'a'
  • [a-z] The set containing 'a' through 'z' and all letters in between, in Unicode order
  • [^a-z] The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+FFFF
  • [[pat1][pat2]] The union of sets specified by pat1 and pat2
  • [[pat1]&[pat2]] The intersection of sets specified by pat1 and pat2
  • [[pat1]-[pat2]] The asymmetric difference of sets specified by pat1 and pat2
  • [:Lu:] The set of characters belonging to the given Unicode category; in this case, Unicode uppercase letters. The long form for this is [:UppercaseLetter:].
  • [:L:] The set of characters belonging to all Unicode categories starting with 'L', that is, [[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]. The long form for this is [:Letter:].

String Values in Sets

String values are enclosed in {curly brackets}.

Set expressionDescription
[abc{def}]A set containing four members, the single characters a, b and c, and the string “def”
[{abc}{def}]A set containing two members, the string “abc” and the string “def”.
[{a}{b}{c}][abc]These two sets are equivalent. Each contains three items, the three individual characters a, b and c. A {string} containing a single character is equivalent to that same character specified in any other way.

Character Quoting and Escaping in Unicode Set Patterns

Single Quote

Two single quotes represents a single quote, either inside or outside single quotes.

Text within single quotes is not interpreted in any way (except for two adjacent single quotes). It is taken as literal text (special characters become non-special).

These quoting conventions for ICU UnicodeSets differ from those of regular expression character set expressions. In regular expressions, single quotes have no special meaning and are treated like any other literal character.

Backslash Escapes

Outside of single quotes, certain backslashed characters have special meaning. Note that these are escapes processed by Unicode Set (this library) and therefore require \\\\ to be entered as a prefix. Elixir also provides similar escapes as native part of its string processing and Elixir's escapes are to be preferred where possible.

EscapeDescription
\uhhhhExactly 4 hex digits; h in [0-9A-Fa-f]
\UhhhhhhhhExactly 8 hex digits
\xhh1-2 hex digits

Certain other escapes are native to Elixir and are applicable in Unicode Sets they are in any Elixir string:

EscapeDescription
\aU+0007 (BELL)
\bU+0008 (BACKSPACE)
\tU+0009 (HORIZONTAL TAB)
\nU+000A (LINE FEED)
\vU+000B (VERTICAL TAB)
\fU+000C (FORM FEED)
\rU+000D (CARRIAGE RETURN)
\U+005C (BACKSLASH)
\xDDrepresents a single byte in hexadecimal (such as \x13)
\uDDDD and \u{D...}represents a Unicode codepoint in hexadecimal (such as \u{1F600})

Anything else following a backslash is mapped to itself, except in an environment where it is defined to have some special meaning. For example, \p{Lu} is the set of uppercase letters in a Unicode Set.

Any character formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \u and \U escapes create literal characters.

Whitespace

Whitespace (as defined by the specification) is ignored unless it is quoted or backslashed.

Property Values

The following property value variants are recognized:

FormatExampleDescription
shortLuomits the type (used to prevent ambiguity and only allowed with the Category and Script properties)
mediumgc=Luuses an abbreviated type and value
longGeneral_Category=Uppercase_Letteruses a full type and value

If the type or value is omitted, then the equals sign is also omitted. The short style is only used for Category and Script properties because these properties are very common and their omission is unambiguous.

In actual practice, you can mix type names and values that are omitted, abbreviated, or full. For example, if Category=Unassigned you could use what is in the table explicitly, \p{gc=Unassigned}, \p{Category=Cn}, or \p{Unassigned}.

When these are processed, case and whitespace are ignored so you may use them for clarity, if desired. For example, \p{Category = Uppercase Letter} or \p{Category = uppercase letter}.

Summary

Functions

Transforms a Unicode Set into a compiled pattern that can be used with String.split/3 and String.replace/3.

Transforms a Unicode Set into a compiled pattern that can be used with String.split/3 and String.replace/3. Raises an exception on error.

Returns a boolean based upon whether var matches the provided unicode_set.

Parses a unicode set and expands the set expressions then compacts the character ranges.

Transforms a Unicode Set into a pattern that can be used with String.split/3 and String.replace/3.

Transforms a Unicode Set into a pattern that can be used with String.split/3 and String.replace/3.

Transforms a Unicode Set into a regex string that can be used as an argument to Regex.compile/1.

Transforms a Unicode Set into a regex string that can be used as an argument to Regex.compile/1.

Transforms a Unicode Set into a list of codepoints that can be used with nimble_parsec.

Transforms a Unicode Set into a list of codepoints that can be used with nimble_parsec.

Types

@type character_range() :: {codepoint(), codepoint()}
@type codepoint() :: 0..1_114_111
@type codepoint_range() :: %Range{first: codepoint(), last: codepoint(), step: term()}
@type generated_match() :: [Macro.t() | String.t()]
@type nimble_list() :: [nimble_range()]
@type nimble_range() ::
  codepoint() | codepoint_range() | {:not, codepoint() | codepoint_range()}
@type operation() ::
  [{operator(), operation() | range_list()}]
  | {operator(), operation() | range_list()}
@type operator() :: :union | :intersection | :difference | :in | :not_in
@type range() :: character_range() | string_range()
@type range_list() :: [range()]
@type state() :: nil | :parsed | :reduced | :expanded
@type string_range() :: {charlist(), charlist()}
@type t() :: %Unicode.Set{
  parsed: operation() | range_list(),
  set: binary(),
  state: state()
}

Functions

Link to this function

compile_pattern(unicode_set)

View Source
@spec compile_pattern(binary()) :: {:ok, [binary()]} | {:error, {module(), binary()}}

Transforms a Unicode Set into a compiled pattern that can be used with String.split/3 and String.replace/3.

Compiled patterns can be the more performant when matching strings.

Arguments

  • unicode_set is a string representation of a Unicode Set

Returns

  • {:ok, compiled_pattern} or

  • {:error, {exception, reason}}

Example

iex> pattern = Unicode.Set.compile_pattern("[[:digit:]]")
{:ok, {:ac, #Reference<0.2927979228.2367029250.255911>}}
iex> String.split("abc1def2ghi3jkl", pattern)
["abc", "def", "ghi", "jkl"]
Link to this function

compile_pattern!(unicode_set)

View Source (since 1.3.0)
@spec compile_pattern!(binary()) :: [binary()] | no_return()

Transforms a Unicode Set into a compiled pattern that can be used with String.split/3 and String.replace/3. Raises an exception on error.

Compiled patterns can be the more performant when matching strings.

Arguments

  • unicode_set is a string representation of a Unicode Set

Returns

  • compiled_pattern or

  • raises an exception.

Example

iex> pattern = Unicode.Set.compile_pattern!("[[:digit:]]")
{:ac, #Reference<0.2927979228.2367029250.255911>}
iex> String.split("abc1def2ghi3jkl", pattern)
["abc", "def", "ghi", "jkl"]
Link to this macro

match?(var, unicode_set)

View Source (macro)

Returns a boolean based upon whether var matches the provided unicode_set.

Arguments

  • var is any integer variable (since codepoints are integers)

  • unicode_set is a binary representation of a unicode set. An exception will be raised if unicode_set is not a compile time binary

Returns

  • true or false

Examples

defguard is_lower(codepoint) when Unicode.Set.match?(codepoint, "[[:Lu:]]")
  • Or as a guard clause itself:
def my_function(<< codepoint :: utf8, _rest :: binary>>)
  when Unicode.Set.match?(codepoint, "[[:Lu:]]")
@spec parse(binary()) :: {:ok, t()} | {:error, {module(), binary()}}
@spec parse!(binary()) :: t() | no_return()
Link to this function

parse_and_reduce(unicode_set)

View Source
@spec parse_and_reduce(binary()) :: {:ok, t()} | {:error, {module(), binary()}}

Parses a unicode set and expands the set expressions then compacts the character ranges.

Link to this function

parse_and_reduce!(unicode_set)

View Source
@spec parse_and_reduce!(binary()) :: t() | no_return()
@spec to_pattern(binary()) :: {:ok, [binary()]} | {:error, {module(), binary()}}

Transforms a Unicode Set into a pattern that can be used with String.split/3 and String.replace/3.

Arguments

  • unicode_set is a string representation of a Unicode Set

Returns

  • {:ok, pattern} or

  • {:error, {exception, reason}}

Example

iex> pattern = Unicode.Set.to_pattern "[[:digit:]]"
{:ok,
 ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "٠", "١", "٢", "٣",
  "٤", "٥", "٦", "٧", "٨", "٩", "۰", "۱", "۲", "۳", "۴", "۵", "۶",
  "۷", "۸", "۹", "߀", "߁", "߂", "߃", "߄", "߅", "߆", "߇", "߈", "߉",
  "०", "१", "२", "३", "४", "५", "६", "७", ...]}
Link to this function

to_pattern!(unicode_set)

View Source
@spec to_pattern!(binary()) :: [binary()] | no_return()

Transforms a Unicode Set into a pattern that can be used with String.split/3 and String.replace/3.

Arguments

  • unicode_set is a string representation of a Unicode Set

Returns

  • pattern or

  • raises an exception

Example

iex> pattern = Unicode.Set.to_pattern "[[:digit:]]"
["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "٠", "١", "٢", "٣",
 "٤", "٥", "٦", "٧", "٨", "٩", "۰", "۱", "۲", "۳", "۴", "۵", "۶",
  "۷", "۸", "۹", "߀", "߁", "߂", "߃", "߄", "߅", "߆", "߇", "߈", "߉"
 "०", "१", "२", "३", "४", "५", "६", "७", ...]
Link to this function

to_regex_string(unicode_set)

View Source
@spec to_regex_string(binary()) :: {:ok, binary()} | {:error, {module(), binary()}}

Transforms a Unicode Set into a regex string that can be used as an argument to Regex.compile/1.

Arguments

  • unicode_set is a string representation of a Unicode Set

Returns

  • {:ok, regex_string} or

  • {:error, {exception, reason}}

Example

iex> Unicode.Set.to_regex_string "[[:Zs]-[ ]]"
{:ok, "[\x{3A}\x{5A}\x{73}]"}
Link to this function

to_regex_string!(unicode_set)

View Source
@spec to_regex_string!(binary()) :: binary() | no_return()

Transforms a Unicode Set into a regex string that can be used as an argument to Regex.compile/1.

Arguments

  • unicode_set is a string representation of a Unicode Set

Returns

  • regex_string or

  • raises an exception

Example

iex> Unicode.Set.to_regex_string "[[:Zs]-[ ]]"
{:ok, "[\x{3A}\x{5A}\x{73}]"}
Link to this function

to_utf8_char(unicode_set)

View Source
@spec to_utf8_char(binary()) :: {:ok, nimble_list()} | {:error, {module(), binary()}}

Transforms a Unicode Set into a list of codepoints that can be used with nimble_parsec.

THe list of codepoints can be used as an argument to NimbleParsec.utf8_char/1.

Arguments

  • unicode_set is a string representation of a Unicode Set

Returns

  • {:ok, list_of_codepints} or

  • {:error, {exception, reason}}

Example

iex> pattern = Unicode.Set.to_utf8_char "[[:digit:]-[:Zs]]"
{:ok,
 [48..57, 1632..1641, 1776..1785, 1984..1993, 2406..2415, 2534..2543,
  2662..2671, 2790..2799, 2918..2927, 3046..3055, 3174..3183, 3302..3311,
  3430..3439, 3558..3567, 3664..3673, 3792..3801, 3872..3881, 4160..4169,
  4240..4249, 6112..6121, 6160..6169, 6470..6479, 6608..6617, 6784..6793,
  6800..6809, 6992..7001, 7088..7097, 7232..7241, 7248..7257, 42528..42537,
  43216..43225, 43264..43273, 43472..43481, 43504..43513, 43600..43609,
  44016..44025, 65296..65305, 66720..66729, 68912..68921, 69734..69743,
  69872..69881, 69942..69951, 70096..70105, 70384..70393, 70736..70745,
  70864..70873, 71248..71257, 71360..71369, ...]}
Link to this function

to_utf8_char!(unicode_set)

View Source
@spec to_utf8_char!(binary()) :: nimble_list() | no_return()

Transforms a Unicode Set into a list of codepoints that can be used with nimble_parsec.

THe list of codepoints can be used as an argument to NimbleParsec.utf8_char/1.

Arguments

  • unicode_set is a string representation of a Unicode Set

Returns

  • list_of_codepints or

  • raises an exception

Example

iex> pattern = Unicode.Set.to_utf8_char! "[[:digit:]-[:Zs]]"
[48..57, 1632..1641, 1776..1785, 1984..1993, 2406..2415, 2534..2543,
 2662..2671, 2790..2799, 2918..2927, 3046..3055, 3174..3183, 3302..3311,
 3430..3439, 3558..3567, 3664..3673, 3792..3801, 3872..3881, 4160..4169,
 4240..4249, 6112..6121, 6160..6169, 6470..6479, 6608..6617, 6784..6793,
 6800..6809, 6992..7001, 7088..7097, 7232..7241, 7248..7257, 42528..42537,
 43216..43225, 43264..43273, 43472..43481, 43504..43513, 43600..43609,
 44016..44025, 65296..65305, 66720..66729, 68912..68921, 69734..69743,
 69872..69881, 69942..69951, 70096..70105, 70384..70393, 70736..70745,
 70864..70873, 71248..71257, 71360..71369, ...]}