Text.Language (Text v0.2.0) View Source

A module to support natural language detection.

The primary models are implementations derived from Language Identification from Text Using N-gram Based Cumulative Frequency Addition

Link to this section Summary

Functions

Classify the natural language of a given text.

Detect the natural language of a given text.

Returns a list of the known classifiers that can be applied as a :classifer option to Text.Language.detect/2

Function to remove text elements that interfer with language detection.

Link to this section Functions

Link to this function

classify(text, options \\ [])

View Source

Specs

Classify the natural language of a given text.

Arguments

  • text is a binary text from which the language is detected.

  • options is a keyword list of options.

Options

  • :corpus is a module encapsulating a body of text in one or more natural languages.A corpus module implements the Text.Corpus behaviour. The default is Text.Corpus.Udhr which is implemented by the text_corpus_udhr package. This package must be installed as a dependency in order for this default to be used.

  • :classifier is the module used to detect the language. The default is Text.Language.Classifier.NaiveBayesian. Other classifiers are Text.Language.Classifier.RankOrder, Text.Classifier.CummulativeFrequency and Text.Language.Classifier.Spearman. Any module that implements the Text.Language.Classifier behaviour may be used.

  • :vocabulary is the vocabulary to be used. The default is hd(corpus.known_vocabularies()). Available vocabularies are returned from corpus.known_vocabularies/0.

  • :only is a list of languages to be used as candidates for the language of text. The default is corpus.known_languages/0 which is all the lanuages known to a given corpus.

  • :max_demand is used to determine the batch size for Flow.from_enumerable/1. The default is 20.

Returns

  • A list of 2-tuples in order of confidence with the first element being the BCP-47 language code and the second element being the score as determined by the requested classifier. The score has no meaning except to order the results by confidence level.
Link to this function

detect(text, options \\ [])

View Source

Specs

detect(String.t(), Keyword.t()) ::
  {:ok, Text.language()} | {:error, {module(), String.t()}}

Detect the natural language of a given text.

Arguments

  • text is a binary text from which the language is detected.

  • options is a keyword list of options.

Options

  • :corpus is a module encapsulating a body of text in one or more natural languages.A corpus module implements the Text.Corpus behaviour. The default is Text.Corpus.Udhr which is implemented by the text_corpus_udhr package. This package must be installed as a dependency in order for this default to be used.

  • :classifier is the module used to detect the language. The default is Text.Language.Classifier.NaiveBayesian. Other classifiers are Text.Language.Classifier.RankOrder, Text.Classifier.CummulativeFrequency and Text.Language.Classifier.Spearman. Any module that implements the Text.Language.Classifier behaviour may be used.

  • :vocabulary is the vocabulary to be used. The default is hd(corpus.known_vocabularies()). Available vocabularies are returned from corpus.known_vocabularies/0.

  • :only is a list of languages to be used as candidates for the language of text. The default is corpus.known_languages/0 which is all the lanuages known to a given corpus.

  • :max_demand is used to determine the batch size for Flow.from_enumerable/1. The default is 20.

Returns

  • A list of 2-tuples in order of confidence with the first element being the BCP-47 language code and the second element being the score as determined by the requested classifier. The score has no meaning except to order the results by confidence level.

Specs

known_classifiers() :: [Text.Language.Classifier.t(), ...]

Returns a list of the known classifiers that can be applied as a :classifer option to Text.Language.detect/2

Specs

normalize_text(String.t()) :: String.t()

Function to remove text elements that interfer with language detection.

Each corpus has a callback normalize_text/1 that is applied when training the classifier and when detecting language from natural text. If desired, the corpus can delegate to this function.

Argument

  • text is any String.t

Returns

  • A normalized String.t