View Source Hanzi (hanyutils v0.3.0)
Han/Chinese character (汉字) utilities and conversion to Pinyin lists.
The main goal of this module is to convert strings containing Han characters into a
Pinyin.pinyin_list/0
. In turn, such a list can be formatted by the functions present in the
Pinyin
module (i.e. Pinyin.marked/1
or Pinyin.numbered/1
).
A string of Han characters can be read with read/1
or sigil_h/2
. These functions both return
a list containing strings mixed with Hanzi.t/0
structs. Such a list can be converted into
a Pinyin.pinyin_list/0
through the use of to_pinyin/2
. Users with more esoteric use cases
can directly modify the Hanzi.t/0
inside the hanzi_list/0
.
to_pinyin-2-and-converters
to_pinyin/2
and converters
Since a given Hanzi may have different valid pronunciations, to_pinyin/2
accepts a second
argument that determines how a given Hanzi.t/0
is converted into a Pinyin.pinyin_list/0
.
This second argument is called a converter. This module includes some standard converters.
Please refer to the documentation of to_pinyin/2
for more information.
data-source
Data source
The results of the functions offered by this module are all ultimately derived from the data
contained in the Unihan_Readings.txt
file of the
Unicode Han Database.
t/0
contains additional information about the available information.
Link to this section Summary
Types
List of Hanzi characters mixed with plain strings.
Representation of a single Hanzi (chinese Character).
Functions
Converter that returns all pronunciations of a character or the most common.
Check if a single character is a valid Han character.
Convert a Hanzi list to a string of characters.
Verify if a list or string contains only characters.
Converter that retrieves the most common pronunciation of a Hanzi.t/0
.
Obtain the Hanzi.t/0
struct for a character.
Converter that returns a list of all known pronunciations of a character.
Read a string and convert it into a list of strings and Hanzi.t/0
structs.
Sigil to create a Hanzi list or struct.
Convert a Hanzi list to a Pinyin list.
Link to this section Types
List of Hanzi characters mixed with plain strings.
@type t() :: %Hanzi{ alt: [Pinyin.t()], char: String.t(), pron: Pinyin.t(), pron_tw: Pinyin.t() | nil }
Representation of a single Hanzi (chinese Character).
This struct contains all the information extraced from the Unihan_readings database for a given character. It contains the following fields:
Key | Description | Unihan_Readings.txt field |
---|---|---|
char | The character itself | |
pron | Most common pronunciation in pinyin | kMandarin |
pron_tw | Most common pronunciation for Taiwan in Pinyin | kMandarin |
alt | All readings defined by the Hanyu Pinyin Dictionary | kHanyuPinyin |
pron-and-pron_tw
pron
and pron_tw
In some rare cases, the most common reading of a hanzi is different in mainland China and in
Taiwan. If this is the case, the most common mainland reading will be stored under the pron
key, while the most common Taiwanese reading will be stored under the pron_tw
key. When the
readings are the same, pron
will contain the reading while pron_tw
will be nil
. Note
that, at the time of writing, only 38 characters out of the 41226 defined by
Unihan_Readings.txt
have a different reading for mainland China and Taiwan.
alt
alt
Some hanzi have different readings based on their exact use. When this is the case, all the
possible readings of a character will stored as a list in alt
.
Link to this section Functions
all_pronunciations(hanzi, left \\ "[ ", mid \\ " | ", right \\ " ]")
View Source@spec all_pronunciations(t(), String.t(), String.t(), String.t()) :: Pinyin.pinyin_list()
Converter that returns all pronunciations of a character or the most common.
If only a single pronunciation is available, it is returned, otherwise, all possible
pronunciations are returned. When all possible pronunciations are returned, left
, mid
and
right
determine how the alternatives are separated. left
is positioned before the first
pronunciation in the list, right
is positioned after the last pronunciation, mid
is
positioned between all the other pronunciations.
examples
Examples
iex> Hanzi.all_pronunciations(~h/你/s)
[%Pinyin{initial: "n", final: "i", tone: 3}]
iex> Hanzi.all_pronunciations(~h/㓎/s)
["[ ", %Pinyin{initial: "q", final: "in", tone: 1}, " | ", %Pinyin{initial: "q", final: "in", tone: 4}, " | ", %Pinyin{initial: "q", final: "in", tone: 3}, " ]"]
iex> Hanzi.all_pronunciations(~h/㓎/s, "", "", "")
["", %Pinyin{initial: "q", final: "in", tone: 1}, "", %Pinyin{initial: "q", final: "in", tone: 4}, "", %Pinyin{initial: "q", final: "in", tone: 3}, ""]
Check if a single character is a valid Han character.
examples
Examples
iex> Hanzi.character?("你")
true
iex> Hanzi.character?("x")
false
iex> Hanzi.character?("你好")
false
@spec characters(hanzi_list(), String.t()) :: String.t()
Convert a Hanzi list to a string of characters.
This function extracts the character of each Hanzi.t/0
in lst
. Normal strings in the list
not modified. After converting the Hanzi in the list to characters, the list is joined with
Enum.join/2
. The joiner
argument will be passed as the joiner
to Enum.join/2
.
examples
Examples
iex> characters(~h/你好/)
"你好"
iex> characters(~h/你hello/)
"你hello"
iex> characters(~h/你好/, ";")
"你;好"
iex> characters(~h/你hello/, ";")
"你;hello"
Verify if a list or string contains only characters.
Note that whitespace is not counted as a character.
examples
Examples
iex> Hanzi.characters?(["你", "好"])
true
iex> Hanzi.characters?(["你", "boo", "好"])
false
iex> Hanzi.characters?("你好")
true
iex> Hanzi.characters?("你 好")
false
@spec common_pronunciation(t(), :cn | :tw) :: Pinyin.pinyin_list()
Converter that retrieves the most common pronunciation of a Hanzi.t/0
.
The additional argument specifies if the most common pronunciation for mainland China or Taiwan is retrieved.
examples
Examples
iex> Hanzi.common_pronunciation(~h/你/s)
[%Pinyin{initial: "n", final: "i", tone: 3}]
iex> Hanzi.common_pronunciation(~h/你/s, :cn)
[%Pinyin{initial: "n", final: "i", tone: 3}]
iex> Hanzi.common_pronunciation(~h/你/s, :tw)
[%Pinyin{initial: "n", final: "i", tone: 3}]
iex> Hanzi.common_pronunciation(~h/万/s)
[%Pinyin{initial: "", final: "wan", tone: 4}]
iex> Hanzi.common_pronunciation(~h/万/s, :cn)
[%Pinyin{initial: "", final: "wan", tone: 4}]
iex> Hanzi.common_pronunciation(~h/万/s, :tw)
[%Pinyin{initial: "m", final: "o", tone: 4}]
Obtain the Hanzi.t/0
struct for a character.
Note that this only works on a single character.
examples
Examples
iex> Hanzi.from_character("你")
%Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}, pron_tw: nil, alt: []}
iex> Hanzi.from_character("x")
nil
iex> Hanzi.from_character("你好")
nil
@spec list_pronunciations(t()) :: Pinyin.pinyin_list()
Converter that returns a list of all known pronunciations of a character.
This converter is similar to all_pronunciations/4
, but it does not include separators around
the various pronunciations.
examples
Examples
iex> Hanzi.list_pronunciations(~h/你/s)
[%Pinyin{initial: "n", final: "i", tone: 3}]
iex> Hanzi.list_pronunciations(~h/㓎/s)
[%Pinyin{initial: "q", final: "in", tone: 1}, %Pinyin{initial: "q", final: "in", tone: 4}, %Pinyin{initial: "q", final: "in", tone: 3}]
iex> Hanzi.list_pronunciations(~h/㓎/s)
[%Pinyin{initial: "q", final: "in", tone: 1}, %Pinyin{initial: "q", final: "in", tone: 4}, %Pinyin{initial: "q", final: "in", tone: 3}]
@spec read(String.t()) :: hanzi_list()
Read a string and convert it into a list of strings and Hanzi.t/0
structs.
This function reads a string containing characters mixed with normal text. The output of this function is a list of strings and Hanzi structs.
The input string may contain any character. Any character in the string that is recognised as a
Han character (by character?/1
) is returned as a Hanzi.t/0
in the returned list. Any
other character in the input is returned unmodified.
examples
Examples
iex> Hanzi.read("你好")
[%Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}, %Hanzi{char: "好", pron: %Pinyin{initial: "h", final: "ao", tone: 3}, alt: [%Pinyin{initial: "h", final: "ao", tone: 3}, %Pinyin{initial: "h", final: "ao", tone: 4}]}]
iex> Hanzi.read("hello, 你")
["hello, ", %Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}]
Sigil to create a Hanzi list or struct.
When used without any modifiers, this sigil converts ins input into a hanzi list through the use
of read/1
. When the s
modifier is used, from_character/1
is used instead.
examples
Examples
iex> ~h/hello, 你/
["hello, ", %Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}]
iex> ~h/你/s
%Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}
iex> ~h/你好/s
nil
@spec to_pinyin(hanzi_list(), (t() -> Pinyin.pinyin_list())) :: Pinyin.pinyin_list()
Convert a Hanzi list to a Pinyin list.
Normal strings in the Hanzi list are returned unmodified. Every Hanzi.t()
is passed as an
argument to converter
, which returns a Pinyin.pinyin_list/0
. This list is added to the
result.
After calling this function, Pinyin.marked/1
or Pinyin.numbered/1
can be used to format the
result.
converters
Converters
A converter is any function that transforms a Hanzi.t/0
into a Pinyin.pinyin_list/0
.
In the most simple case, such a converter simply returns the most common pronunciation. In more
complicated cases, such a converter returns all the possible pronunciations of a Hanzi,
separated by strings.
The Hanzi
module includes two converters: common_pronunciation/2
and all_pronunciations/4
.
If no converter is specified, &common_pronunciation(&1, :cn)
is used.
If you wish to write your own converter, the functions mentioned above, and the examples below should be a good starting point.
examples
Examples
iex> to_pinyin(~h/你好/)
[%Pinyin{initial: "n", final: "i", tone: 3}, %Pinyin{initial: "h", final: "ao", tone: 3}]
iex> to_pinyin(~h/二万/, &common_pronunciation(&1, :tw))
[%Pinyin{initial: "", final: "er", tone: 4}, %Pinyin{initial: "m", final: "o", tone: 4}]
iex> to_pinyin(~h/你好/, &all_pronunciations/1)
[%Pinyin{initial: "n", final: "i", tone: 3}, "[ ", %Pinyin{initial: "h", final: "ao", tone: 3}, " | ", %Pinyin{initial: "h", final: "ao", tone: 4}, " ]"]
iex> to_pinyin(~h/你好/, &all_pronunciations(&1, "", "", ""))
[%Pinyin{initial: "n", final: "i", tone: 3}, "", %Pinyin{initial: "h", final: "ao", tone: 3}, "", %Pinyin{initial: "h", final: "ao", tone: 4}, ""]
iex> to_pinyin(~h/你好/, fn %Hanzi{pron: p} -> [p] end)
[%Pinyin{initial: "n", final: "i", tone: 3}, %Pinyin{initial: "h", final: "ao", tone: 3}]