View Source Hanzi (hanyutils v0.3.0)

Han/Chinese character (汉字) utilities and conversion to Pinyin lists.

The main goal of this module is to convert strings containing Han characters into a Pinyin.pinyin_list/0. In turn, such a list can be formatted by the functions present in the Pinyin module (i.e. Pinyin.marked/1 or Pinyin.numbered/1).

A string of Han characters can be read with read/1 or sigil_h/2. These functions both return a list containing strings mixed with Hanzi.t/0 structs. Such a list can be converted into a Pinyin.pinyin_list/0 through the use of to_pinyin/2. Users with more esoteric use cases can directly modify the Hanzi.t/0 inside the hanzi_list/0.

to_pinyin-2-and-converters

to_pinyin/2 and converters

Since a given Hanzi may have different valid pronunciations, to_pinyin/2 accepts a second argument that determines how a given Hanzi.t/0 is converted into a Pinyin.pinyin_list/0. This second argument is called a converter. This module includes some standard converters. Please refer to the documentation of to_pinyin/2 for more information.

data-source

Data source

The results of the functions offered by this module are all ultimately derived from the data contained in the Unihan_Readings.txt file of the Unicode Han Database. t/0 contains additional information about the available information.

Link to this section Summary

Types

List of Hanzi characters mixed with plain strings.

t()

Representation of a single Hanzi (chinese Character).

Functions

Converter that returns all pronunciations of a character or the most common.

Check if a single character is a valid Han character.

Convert a Hanzi list to a string of characters.

Verify if a list or string contains only characters.

Converter that retrieves the most common pronunciation of a Hanzi.t/0.

Obtain the Hanzi.t/0 struct for a character.

Converter that returns a list of all known pronunciations of a character.

Read a string and convert it into a list of strings and Hanzi.t/0 structs.

Sigil to create a Hanzi list or struct.

Convert a Hanzi list to a Pinyin list.

Link to this section Types

@type hanzi_list() :: [t() | String.t()]

List of Hanzi characters mixed with plain strings.

@type t() :: %Hanzi{
  alt: [Pinyin.t()],
  char: String.t(),
  pron: Pinyin.t(),
  pron_tw: Pinyin.t() | nil
}

Representation of a single Hanzi (chinese Character).

This struct contains all the information extraced from the Unihan_readings database for a given character. It contains the following fields:

KeyDescriptionUnihan_Readings.txt field
charThe character itself
pronMost common pronunciation in pinyinkMandarin
pron_twMost common pronunciation for Taiwan in PinyinkMandarin
altAll readings defined by the Hanyu Pinyin DictionarykHanyuPinyin

pron-and-pron_tw

pron and pron_tw

In some rare cases, the most common reading of a hanzi is different in mainland China and in Taiwan. If this is the case, the most common mainland reading will be stored under the pron key, while the most common Taiwanese reading will be stored under the pron_tw key. When the readings are the same, pron will contain the reading while pron_tw will be nil. Note that, at the time of writing, only 38 characters out of the 41226 defined by Unihan_Readings.txt have a different reading for mainland China and Taiwan.

alt

alt

Some hanzi have different readings based on their exact use. When this is the case, all the possible readings of a character will stored as a list in alt.

Link to this section Functions

Link to this function

all_pronunciations(hanzi, left \\ "[ ", mid \\ " | ", right \\ " ]")

View Source
@spec all_pronunciations(t(), String.t(), String.t(), String.t()) ::
  Pinyin.pinyin_list()

Converter that returns all pronunciations of a character or the most common.

If only a single pronunciation is available, it is returned, otherwise, all possible pronunciations are returned. When all possible pronunciations are returned, left, mid and right determine how the alternatives are separated. left is positioned before the first pronunciation in the list, right is positioned after the last pronunciation, mid is positioned between all the other pronunciations.

examples

Examples

iex> Hanzi.all_pronunciations(~h/你/s)
[%Pinyin{initial: "n", final: "i", tone: 3}]

iex> Hanzi.all_pronunciations(~h/㓎/s)
["[ ", %Pinyin{initial: "q", final: "in", tone: 1}, " | ", %Pinyin{initial: "q", final: "in", tone: 4}, " | ", %Pinyin{initial: "q", final: "in", tone: 3}, " ]"]

iex> Hanzi.all_pronunciations(~h/㓎/s, "", "", "")
["", %Pinyin{initial: "q", final: "in", tone: 1}, "", %Pinyin{initial: "q", final: "in", tone: 4}, "", %Pinyin{initial: "q", final: "in", tone: 3}, ""]
@spec character?(String.t()) :: boolean()

Check if a single character is a valid Han character.

examples

Examples

iex> Hanzi.character?("你")
true

iex> Hanzi.character?("x")
false

iex> Hanzi.character?("你好")
false
Link to this function

characters(lst, joiner \\ "")

View Source
@spec characters(hanzi_list(), String.t()) :: String.t()

Convert a Hanzi list to a string of characters.

This function extracts the character of each Hanzi.t/0 in lst. Normal strings in the list not modified. After converting the Hanzi in the list to characters, the list is joined with Enum.join/2. The joiner argument will be passed as the joiner to Enum.join/2.

examples

Examples

iex> characters(~h/你好/)
"你好"

iex> characters(~h/你hello/)
"你hello"

iex> characters(~h/你好/, ";")
"你;好"

iex> characters(~h/你hello/, ";")
"你;hello"
@spec characters?(String.t() | [String.t()]) :: boolean()

Verify if a list or string contains only characters.

Note that whitespace is not counted as a character.

examples

Examples

iex> Hanzi.characters?(["你", "好"])
true

iex> Hanzi.characters?(["你", "boo", "好"])
false

iex> Hanzi.characters?("你好")
true

iex> Hanzi.characters?("你 好")
false
Link to this function

common_pronunciation(hanzi, loc \\ :cn)

View Source
@spec common_pronunciation(t(), :cn | :tw) :: Pinyin.pinyin_list()

Converter that retrieves the most common pronunciation of a Hanzi.t/0.

The additional argument specifies if the most common pronunciation for mainland China or Taiwan is retrieved.

examples

Examples

iex> Hanzi.common_pronunciation(~h/你/s)
[%Pinyin{initial: "n", final: "i", tone: 3}]

iex> Hanzi.common_pronunciation(~h/你/s, :cn)
[%Pinyin{initial: "n", final: "i", tone: 3}]

iex> Hanzi.common_pronunciation(~h/你/s, :tw)
[%Pinyin{initial: "n", final: "i", tone: 3}]

iex> Hanzi.common_pronunciation(~h/万/s)
[%Pinyin{initial: "", final: "wan", tone: 4}]

iex> Hanzi.common_pronunciation(~h/万/s, :cn)
[%Pinyin{initial: "", final: "wan", tone: 4}]

iex> Hanzi.common_pronunciation(~h/万/s, :tw)
[%Pinyin{initial: "m", final: "o", tone: 4}]
Link to this function

from_character(character)

View Source
@spec from_character(String.t()) :: t() | nil

Obtain the Hanzi.t/0 struct for a character.

Note that this only works on a single character.

examples

Examples

iex> Hanzi.from_character("你")
%Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}, pron_tw: nil, alt: []}

iex> Hanzi.from_character("x")
nil

iex> Hanzi.from_character("你好")
nil
Link to this function

list_pronunciations(hanzi)

View Source
@spec list_pronunciations(t()) :: Pinyin.pinyin_list()

Converter that returns a list of all known pronunciations of a character.

This converter is similar to all_pronunciations/4, but it does not include separators around the various pronunciations.

examples

Examples

iex> Hanzi.list_pronunciations(~h/你/s)
[%Pinyin{initial: "n", final: "i", tone: 3}]

iex> Hanzi.list_pronunciations(~h/㓎/s)
[%Pinyin{initial: "q", final: "in", tone: 1}, %Pinyin{initial: "q", final: "in", tone: 4}, %Pinyin{initial: "q", final: "in", tone: 3}]

iex> Hanzi.list_pronunciations(~h/㓎/s)
[%Pinyin{initial: "q", final: "in", tone: 1}, %Pinyin{initial: "q", final: "in", tone: 4}, %Pinyin{initial: "q", final: "in", tone: 3}]
@spec read(String.t()) :: hanzi_list()

Read a string and convert it into a list of strings and Hanzi.t/0 structs.

This function reads a string containing characters mixed with normal text. The output of this function is a list of strings and Hanzi structs.

The input string may contain any character. Any character in the string that is recognised as a Han character (by character?/1) is returned as a Hanzi.t/0 in the returned list. Any other character in the input is returned unmodified.

examples

Examples

iex> Hanzi.read("你好")
[%Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}, %Hanzi{char: "好", pron: %Pinyin{initial: "h", final: "ao", tone: 3}, alt: [%Pinyin{initial: "h", final: "ao", tone: 3}, %Pinyin{initial: "h", final: "ao", tone: 4}]}]

iex> Hanzi.read("hello, 你")
["hello, ", %Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}]
Link to this macro

sigil_h(arg, list)

View Source (macro)

Sigil to create a Hanzi list or struct.

When used without any modifiers, this sigil converts ins input into a hanzi list through the use of read/1. When the s modifier is used, from_character/1 is used instead.

examples

Examples

iex> ~h/hello, 你/
["hello, ", %Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}]

iex> ~h/你/s
%Hanzi{char: "你", pron: %Pinyin{initial: "n", final: "i", tone: 3}}

iex> ~h/你好/s
nil
Link to this function

to_pinyin(lst, converter \\ &common_pronunciation/1)

View Source
@spec to_pinyin(hanzi_list(), (t() -> Pinyin.pinyin_list())) :: Pinyin.pinyin_list()

Convert a Hanzi list to a Pinyin list.

Normal strings in the Hanzi list are returned unmodified. Every Hanzi.t() is passed as an argument to converter, which returns a Pinyin.pinyin_list/0. This list is added to the result.

After calling this function, Pinyin.marked/1 or Pinyin.numbered/1 can be used to format the result.

converters

Converters

A converter is any function that transforms a Hanzi.t/0 into a Pinyin.pinyin_list/0. In the most simple case, such a converter simply returns the most common pronunciation. In more complicated cases, such a converter returns all the possible pronunciations of a Hanzi, separated by strings.

The Hanzi module includes two converters: common_pronunciation/2 and all_pronunciations/4. If no converter is specified, &common_pronunciation(&1, :cn) is used.

If you wish to write your own converter, the functions mentioned above, and the examples below should be a good starting point.

examples

Examples

iex> to_pinyin(~h/你好/)
[%Pinyin{initial: "n", final: "i", tone: 3}, %Pinyin{initial: "h", final: "ao", tone: 3}]

iex> to_pinyin(~h/二万/, &common_pronunciation(&1, :tw))
[%Pinyin{initial: "", final: "er", tone: 4}, %Pinyin{initial: "m", final: "o", tone: 4}]

iex> to_pinyin(~h/你好/, &all_pronunciations/1)
[%Pinyin{initial: "n", final: "i", tone: 3}, "[ ", %Pinyin{initial: "h", final: "ao", tone: 3}, " | ", %Pinyin{initial: "h", final: "ao", tone: 4}, " ]"]

iex> to_pinyin(~h/你好/, &all_pronunciations(&1, "", "", ""))
[%Pinyin{initial: "n", final: "i", tone: 3}, "", %Pinyin{initial: "h", final: "ao", tone: 3}, "", %Pinyin{initial: "h", final: "ao", tone: 4}, ""]

iex> to_pinyin(~h/你好/, fn %Hanzi{pron: p} -> [p] end)
[%Pinyin{initial: "n", final: "i", tone: 3}, %Pinyin{initial: "h", final: "ao", tone: 3}]