Text.Language (Text v0.2.0) View Source
A module to support natural language detection.
The primary models are implementations derived from Language Identification from Text Using N-gram Based Cumulative Frequency Addition
Link to this section Summary
Functions
Classify the natural language of a given text.
Detect the natural language of a given text.
Returns a list of the known
classifiers that can be applied as
a :classifer option to Text.Language.detect/2
Function to remove text elements that interfer with language detection.
Link to this section Functions
Specs
classify(String.t(), Keyword.t()) :: Text.Language.Classifier.frequency_list() | {:error, {module(), String.t()}}
Classify the natural language of a given text.
Arguments
textis a binary text from which the language is detected.optionsis a keyword list of options.
Options
:corpusis a module encapsulating a body of text in one or more natural languages.A corpus module implements theText.Corpusbehaviour. The default isText.Corpus.Udhrwhich is implemented by the text_corpus_udhr package. This package must be installed as a dependency in order for this default to be used.:classifieris the module used to detect the language. The default isText.Language.Classifier.NaiveBayesian. Other classifiers areText.Language.Classifier.RankOrder,Text.Classifier.CummulativeFrequencyandText.Language.Classifier.Spearman. Any module that implements theText.Language.Classifierbehaviour may be used.:vocabularyis the vocabulary to be used. The default ishd(corpus.known_vocabularies()). Available vocabularies are returned fromcorpus.known_vocabularies/0.:onlyis a list of languages to be used as candidates for the language oftext. The default iscorpus.known_languages/0which is all the lanuages known to a given corpus.:max_demandis used to determine the batch size forFlow.from_enumerable/1. The default is20.
Returns
- A list of
2-tuplesin order of confidence with the first element being the BCP-47 language code and the second element being the score as determined by the requested classifier. The score has no meaning except to order the results by confidence level.
Specs
detect(String.t(), Keyword.t()) :: {:ok, Text.language()} | {:error, {module(), String.t()}}
Detect the natural language of a given text.
Arguments
textis a binary text from which the language is detected.optionsis a keyword list of options.
Options
:corpusis a module encapsulating a body of text in one or more natural languages.A corpus module implements theText.Corpusbehaviour. The default isText.Corpus.Udhrwhich is implemented by the text_corpus_udhr package. This package must be installed as a dependency in order for this default to be used.:classifieris the module used to detect the language. The default isText.Language.Classifier.NaiveBayesian. Other classifiers areText.Language.Classifier.RankOrder,Text.Classifier.CummulativeFrequencyandText.Language.Classifier.Spearman. Any module that implements theText.Language.Classifierbehaviour may be used.:vocabularyis the vocabulary to be used. The default ishd(corpus.known_vocabularies()). Available vocabularies are returned fromcorpus.known_vocabularies/0.:onlyis a list of languages to be used as candidates for the language oftext. The default iscorpus.known_languages/0which is all the lanuages known to a given corpus.:max_demandis used to determine the batch size forFlow.from_enumerable/1. The default is20.
Returns
- A list of
2-tuplesin order of confidence with the first element being the BCP-47 language code and the second element being the score as determined by the requested classifier. The score has no meaning except to order the results by confidence level.
Specs
known_classifiers() :: [Text.Language.Classifier.t(), ...]
Returns a list of the known
classifiers that can be applied as
a :classifer option to Text.Language.detect/2
Specs
Function to remove text elements that interfer with language detection.
Each corpus has a callback normalize_text/1
that is applied when training the
classifier and when detecting language
from natural text. If desired, the corpus
can delegate to this function.
Argument
textis anyString.t
Returns
- A normalized
String.t