Text.Language (Text v0.2.0) View Source
A module to support natural language detection.
The primary models are implementations derived from Language Identification from Text Using N-gram Based Cumulative Frequency Addition
Link to this section Summary
Functions
Classify the natural language of a given text.
Detect the natural language of a given text.
Returns a list of the known
classifiers that can be applied as
a :classifer
option to Text.Language.detect/2
Function to remove text elements that interfer with language detection.
Link to this section Functions
Specs
classify(String.t(), Keyword.t()) :: Text.Language.Classifier.frequency_list() | {:error, {module(), String.t()}}
Classify the natural language of a given text.
Arguments
text
is a binary text from which the language is detected.options
is a keyword list of options.
Options
:corpus
is a module encapsulating a body of text in one or more natural languages.A corpus module implements theText.Corpus
behaviour. The default isText.Corpus.Udhr
which is implemented by the text_corpus_udhr package. This package must be installed as a dependency in order for this default to be used.:classifier
is the module used to detect the language. The default isText.Language.Classifier.NaiveBayesian
. Other classifiers areText.Language.Classifier.RankOrder
,Text.Classifier.CummulativeFrequency
andText.Language.Classifier.Spearman
. Any module that implements theText.Language.Classifier
behaviour may be used.:vocabulary
is the vocabulary to be used. The default ishd(corpus.known_vocabularies())
. Available vocabularies are returned fromcorpus.known_vocabularies/0
.:only
is a list of languages to be used as candidates for the language oftext
. The default iscorpus.known_languages/0
which is all the lanuages known to a given corpus.:max_demand
is used to determine the batch size forFlow.from_enumerable/1
. The default is20
.
Returns
- A list of
2-tuples
in order of confidence with the first element being the BCP-47 language code and the second element being the score as determined by the requested classifier. The score has no meaning except to order the results by confidence level.
Specs
detect(String.t(), Keyword.t()) :: {:ok, Text.language()} | {:error, {module(), String.t()}}
Detect the natural language of a given text.
Arguments
text
is a binary text from which the language is detected.options
is a keyword list of options.
Options
:corpus
is a module encapsulating a body of text in one or more natural languages.A corpus module implements theText.Corpus
behaviour. The default isText.Corpus.Udhr
which is implemented by the text_corpus_udhr package. This package must be installed as a dependency in order for this default to be used.:classifier
is the module used to detect the language. The default isText.Language.Classifier.NaiveBayesian
. Other classifiers areText.Language.Classifier.RankOrder
,Text.Classifier.CummulativeFrequency
andText.Language.Classifier.Spearman
. Any module that implements theText.Language.Classifier
behaviour may be used.:vocabulary
is the vocabulary to be used. The default ishd(corpus.known_vocabularies())
. Available vocabularies are returned fromcorpus.known_vocabularies/0
.:only
is a list of languages to be used as candidates for the language oftext
. The default iscorpus.known_languages/0
which is all the lanuages known to a given corpus.:max_demand
is used to determine the batch size forFlow.from_enumerable/1
. The default is20
.
Returns
- A list of
2-tuples
in order of confidence with the first element being the BCP-47 language code and the second element being the score as determined by the requested classifier. The score has no meaning except to order the results by confidence level.
Specs
known_classifiers() :: [Text.Language.Classifier.t(), ...]
Returns a list of the known
classifiers that can be applied as
a :classifer
option to Text.Language.detect/2
Specs
Function to remove text elements that interfer with language detection.
Each corpus has a callback normalize_text/1
that is applied when training the
classifier and when detecting language
from natural text. If desired, the corpus
can delegate to this function.
Argument
text
is anyString.t
Returns
- A normalized
String.t