Lingua (lingua v0.3.6)

Lingua wraps Peter M. Stahl's lingua-rs language detection library. This wrapper follows the lingua-rs API closely, so consult the documentation for more information.

Summary

Types

Builder option for configuring language detection

Confidence value tuple with language and confidence score

Detection options

Detection result

ISO 639-1 two-letter language code (e.g., :en, :es, :he)

ISO 639-3 three-letter language code (e.g., :eng, :spa, :heb)

A supported language atom (e.g., :english, :spanish, :hebrew)

Language identifier - can be a language name, ISO 639-1, or ISO 639-3 code

Functions

Get the list of supported languages.

Get the list of supported languages using Arabic script.

Get the list of supported languages using Cyrillic script.

Get the list of supported languages using Devanagari script.

Get the list of supported languages using Latin script.

Get the list of supported spoken languages.

Detect the language of the given input text. By default, all supported languages will be considered and the minimum relative distance is 0.0.

Like detect/2, but returns the result value or raises an error.

Initialize the detector. Calling this is optional but it may come in handy in cases where you want lingua-rs to load the language corpora so that subsequent calls to detect are fast. The first time the detector is run it can take some time to load (~12 seconds on my Macbook Pro). See also the preload_language_models option below.

Get the ISO 639-1 language code for the given language.

Get the ISO 639-3 language code for the given language.

Get the language for the given ISO 639-1 or 639-3 language code.

Get the language for the given ISO 639-1 language code.

Get the language for the given ISO 639-3 language code.

Types

builder_option()

@type builder_option() ::
  :all_languages
  | :all_spoken_languages
  | :all_languages_with_arabic_script
  | :all_languages_with_cyrillic_script
  | :all_languages_with_devanagari_script
  | :all_languages_with_latin_script
  | :with_languages
  | :without_languages

Builder option for configuring language detection

confidence_value()

@type confidence_value() :: {language(), float()}

Confidence value tuple with language and confidence score

detect_option()

@type detect_option() ::
  {:builder_option, builder_option()}
  | {:languages, [language_identifier()]}
  | {:minimum_relative_distance, float()}
  | {:compute_language_confidence_values, boolean()}
  | {:preload_language_models, boolean()}
  | {:low_accuracy_mode, boolean()}

Detection options

detect_result()

@type detect_result() ::
  {:ok, language() | :no_match | [confidence_value()]}
  | {:error, :insufficient_languages | :out_of_range_minimum_relative_distance}

Detection result

iso_639_1()

@type iso_639_1() :: atom()

ISO 639-1 two-letter language code (e.g., :en, :es, :he)

iso_639_3()

@type iso_639_3() :: atom()

ISO 639-3 three-letter language code (e.g., :eng, :spa, :heb)

language()

@type language() :: atom()

A supported language atom (e.g., :english, :spanish, :hebrew)

language_identifier()

@type language_identifier() :: language() | iso_639_1() | iso_639_3()

Language identifier - can be a language name, ISO 639-1, or ISO 639-3 code

Functions

all_languages()

@spec all_languages() :: [language()]

Get the list of supported languages.

Example

iex> Lingua.all_languages()
[:afrikaans, :albanian, :arabic, :armenian, :azerbaijani, :basque, :belarusian,
 :bengali, :bokmal, :bosnian, :bulgarian, :catalan, :chinese, :croatian, :czech,
 :danish, :dutch, :english, :esperanto, :estonian, :finnish, :french, :ganda,
 :georgian, :german, :greek, :gujarati, :hebrew, :hindi, :hungarian, :icelandic,
 :indonesian, :irish, :italian, :japanese, :kazakh, :korean, :latin, :latvian,
 :lithuanian, :macedonian, :malay, :maori, :marathi, :mongolian, :nynorsk,
 :persian, :polish, :portuguese, :punjabi, :romanian, :russian, :serbian,
 :shona, :slovak, :slovene, :somali, :sotho, :spanish, :swahili, :swedish,
 :tagalog, :tamil, :telugu, :thai, :tsonga, :tswana, :turkish, :ukrainian,
 :urdu, :vietnamese, :welsh, :xhosa, :yoruba, :zulu]

all_languages_with_arabic_script()

@spec all_languages_with_arabic_script() :: [language()]

Get the list of supported languages using Arabic script.

Example

iex> Lingua.all_languages_with_arabic_script()
[:arabic, :persian, :urdu]

all_languages_with_cyrillic_script()

@spec all_languages_with_cyrillic_script() :: [language()]

Get the list of supported languages using Cyrillic script.

Example

iex> Lingua.all_languages_with_cyrillic_script()
[:belarusian, :bulgarian, :kazakh, :macedonian, :mongolian, :russian, :serbian, :ukrainian]

all_languages_with_devanagari_script()

@spec all_languages_with_devanagari_script() :: [language()]

Get the list of supported languages using Devanagari script.

Example

iex> Lingua.all_languages_with_devanagari_script()
[:hindi, :marathi]

all_languages_with_latin_script()

@spec all_languages_with_latin_script() :: [language()]

Get the list of supported languages using Latin script.

Example

iex> Lingua.all_languages_with_latin_script()
[:afrikaans, :albanian, :azerbaijani, :basque, :bokmal, :bosnian, :catalan,
 :croatian, :czech, :danish, :dutch, :english, :esperanto, :estonian, :finnish,
 :french, :ganda, :german, :hungarian, :icelandic, :indonesian, :irish,
 :italian, :latin, :latvian, :lithuanian, :malay, :maori, :nynorsk, :polish,
 :portuguese, :romanian, :shona, :slovak, :slovene, :somali, :sotho, :spanish,
 :swahili, :swedish, :tagalog, :tsonga, :tswana, :turkish, :vietnamese, :welsh,
 :xhosa, :yoruba, :zulu]

all_spoken_languages()

@spec all_spoken_languages() :: [language()]

Get the list of supported spoken languages.

Example

iex> Lingua.all_spoken_languages()
[:afrikaans, :albanian, :arabic, :armenian, :azerbaijani, :basque, :belarusian,
 :bengali, :bokmal, :bosnian, :bulgarian, :catalan, :chinese, :croatian, :czech,
 :danish, :dutch, :english, :esperanto, :estonian, :finnish, :french, :ganda,
 :georgian, :german, :greek, :gujarati, :hebrew, :hindi, :hungarian, :icelandic,
 :indonesian, :irish, :italian, :japanese, :kazakh, :korean, :latvian,
 :lithuanian, :macedonian, :malay, :maori, :marathi, :mongolian, :nynorsk,
 :persian, :polish, :portuguese, :punjabi, :romanian, :russian, :serbian,
 :shona, :slovak, :slovene, :somali, :sotho, :spanish, :swahili, :swedish,
 :tagalog, :tamil, :telugu, :thai, :tsonga, :tswana, :turkish, :ukrainian,
 :urdu, :vietnamese, :welsh, :xhosa, :yoruba, :zulu]

detect(text, options \\ [])

@spec detect(String.t(), [detect_option()]) :: detect_result()

Detect the language of the given input text. By default, all supported languages will be considered and the minimum relative distance is 0.0.

Returns the detected language, or a list of languages and their confidence values, or :no_match if the given text doesn't match a language.

Options

  • builder_option: - can be one of the following (defaults to :all_languages):

    • :all_languages - consider every supported language
    • :all_spoken_languages - consider only currently spoken languages
    • :all_languages_with_arabic_script - consider only languages written in Arabic script
    • :all_languages_with_cyrillic_script - consider only languages written in Cyrillic script
    • :all_languages_with_devanagari_script - consider only languages written in Devanagari script
    • :all_languages_with_latin_script - consider only languages written in Latin script
    • :with_languages - consider only the languages supplied in the languages option
    • :without_languages - consider all languages except those supplied in the languages option
  • languages: - specify two or more languages to consider or exclude depending on builder_option: (defaults to []). Accepts any combination of language names, ISO 639-1 codes (2-letter), or ISO 639-3 codes (3-letter). For example: [:english, :ru, :heb] mixes the language name with ISO 639-1 and ISO 639-3 codes.

  • minimum_relative_distance: - specify the minimum relative distance (0.0 - 0.99) required for a language to be considered a match for the input. See the lingua-rs documentation for details. (defaults to 0.0)

  • compute_language_confidence_values: - returns the full list of language matches for the input and their confidence values. (defaults to false)

  • preload_language_models: - preload all language models instead of just those required for the match. (defaults to false)

  • low_accuracy_mode: - use low accuracy mode for faster detection at the cost of accuracy. (defaults to false)

Return Values

  • {:ok, language} - the detected language (e.g., {:ok, :english})
  • {:ok, :no_match} - no language matched the input
  • {:ok, [confidence_values]} - list of {language, confidence} tuples when compute_language_confidence_values: true
  • {:error, :insufficient_languages} - fewer than 2 languages provided with :with_languages or :without_languages
  • {:error, :out_of_range_minimum_relative_distance} - minimum_relative_distance not in range 0.0..0.99

Examples

iex> Lingua.detect("this is definitely English")
{:ok, :english}

iex> Lingua.detect("וזה בעברית")
{:ok, :hebrew}

iex> Lingua.detect("państwowych", builder_option: :with_languages, languages: [:english, :russian, :polish])
{:ok, :polish}

iex> Lingua.detect("ѕидови", builder_option: :all_languages_with_cyrillic_script)
{:ok, :macedonian}

iex> Lingua.detect("כלב", builder_option: :with_languages, languages: [:english, :russian, :polish])
{:ok, :no_match}

iex> Lingua.detect("what in the world is this", builder_option: :with_languages, languages: [:english, :russian, :hebrew], compute_language_confidence_values: true)
{:ok, [{:english, 1.0}, {:hebrew, 0.0}, {:russian, 0.0}]}

Using ISO codes in the languages list:

iex> Lingua.detect("hello world", builder_option: :with_languages, languages: [:en, :de])
{:ok, :english}

iex> Lingua.detect("hello world", builder_option: :with_languages, languages: [:eng, :deu])
{:ok, :english}

detect!(text, options \\ [])

@spec detect!(String.t(), [detect_option()]) ::
  language() | :no_match | [confidence_value()]

Like detect/2, but returns the result value or raises an error.

Examples

iex> Lingua.detect!("this is definitely English")
:english

iex> Lingua.detect!("hello", builder_option: :with_languages, languages: [:english, :german])
:english

init()

@spec init() :: :ok

Initialize the detector. Calling this is optional but it may come in handy in cases where you want lingua-rs to load the language corpora so that subsequent calls to detect are fast. The first time the detector is run it can take some time to load (~12 seconds on my Macbook Pro). See also the preload_language_models option below.

Example

iex> Lingua.init()
:ok

iso_code_639_1_for_language(language)

@spec iso_code_639_1_for_language(language()) ::
  {:ok, iso_639_1()} | {:error, :unrecognized_language}

Get the ISO 639-1 language code for the given language.

Example

iex> Lingua.iso_code_639_1_for_language(:english)
{:ok, :en}
iex> Lingua.iso_code_639_1_for_language(:nope)
{:error, :unrecognized_language}

iso_code_639_3_for_language(language)

@spec iso_code_639_3_for_language(language()) ::
  {:ok, iso_639_3()} | {:error, :unrecognized_language}

Get the ISO 639-3 language code for the given language.

Example

iex> Lingua.iso_code_639_3_for_language(:english)
{:ok, :eng}
iex> Lingua.iso_code_639_3_for_language(:nope)
{:error, :unrecognized_language}

language_for_iso_code(code)

@spec language_for_iso_code(iso_639_1() | iso_639_3()) ::
  {:ok, language()} | {:error, :unrecognized_iso_code}

Get the language for the given ISO 639-1 or 639-3 language code.

Example

iex> Lingua.language_for_iso_code(:en)
{:ok, :english}
iex> Lingua.language_for_iso_code(:eng)
{:ok, :english}
iex> Lingua.language_for_iso_code(:mop)
{:error, :unrecognized_iso_code}

language_for_iso_code_639_1(code)

@spec language_for_iso_code_639_1(iso_639_1()) ::
  {:ok, language()} | {:error, :unrecognized_iso_code}

Get the language for the given ISO 639-1 language code.

Example

iex> Lingua.language_for_iso_code_639_1(:en)
{:ok, :english}
iex> Lingua.language_for_iso_code_639_1(:er)
{:error, :unrecognized_iso_code}

language_for_iso_code_639_3(code)

@spec language_for_iso_code_639_3(iso_639_3()) ::
  {:ok, language()} | {:error, :unrecognized_iso_code}

Get the language for the given ISO 639-3 language code.

Example

iex> Lingua.language_for_iso_code_639_3(:eng)
{:ok, :english}
iex> Lingua.language_for_iso_code_639_3(:enr)
{:error, :unrecognized_iso_code}