Text
Text & language processing for Elixir. Initial release focuses on:
- [x] n-gram generation from text
- [x] pluralization of english words
- [x] word counting (word freqencies)
- [x] language detection using pluggable classifier, vocabulary and corpus backends.
Second phase will focus on:
- Stemming
- tokenization and part-of-speech tagging (at least for english)
- Sentiment analysis
Each of these phases requires prior development. See below.
Status Update Sept 2021
The Text
project remains active and maintained. However with the advent of the amazing Numerical Elixir (Nx) project, many improved opportunities to leverage ML for text analysis open up and this is the planned path. I expect to focus using ML for the additional planned functionality as a calendar year 2022 project. Bug reports, PR and suggests are welcome!
Installation
def deps do
[
{:text, "~> 0.2.0"}
]
end
Word Counting
text
contains an implementation of word counting that is oriented towards large streams of words rather than discrete strings. Input to Text.Word.word_count/2
can be a String.t
, File.Stream.t
or Flow.t
allowing flexible streaming of text.
English Pluralization
text
includes an inflector for the English language that takes an approach based upon An Algorithmic Approach to English Pluralization. See the module Text.Inflect.En
and the functions:
Text.Inflect.En.pluralize/2
Text.Inflect.En.pluralize_noun/2
Text.Inflect.En.pluralize_verb/1
Text.Inflect.En.pluralize_adjective/1
Language Detection
text
contains 3 language classifiers to aid in natural language detection. However it does not include any corpora; these are contained in separate libraries. The available classifiers are:
Text.Language.Classifier.CommulativeFrequency
Text.Language.Classifier.NaiveBayesian
Text.Language.Classifier.RankOrder
Additional classifiers can be added by defining a module that implements the Text.Language.Classifier
behaviour.
The library text_corpus_udhr implements the Text.Corpus
behaviour for the United National Declaration of Human Rights which is available for download in 423 languages from Unicode.
N-Gram generation
The Text.Ngram
module supports efficient generation of n-grams of length 2
to 7
. See Text.Ngram.ngram/2
.
Down the rabbit hole
Text analysis at a fundamental level requires segmenting arbitrary text in any language into characters (graphemes), words and sentences. This is a complex topic covered by the Unicode text segmentation standard agumented by localised rules in CLDR's segmentations data.
Therefore in order to provide higher order text analysis the order of development looks like this:
Finish the Unicode regular expression engine in ex_unicode_set. Most of the work is complete but compound character classes needs further work. Unicode regular expressions are required to implement both Unicode transforms and Unicode segmentation
Implement basic Unicode word and sentence segmentation in ex_unicode_string. Grapheme cluster segmentation is available in the standard library as
String.graphemes/1
Add CLDR tailorings for locale-specific segmentation of words and sentences.
Finish up the Snowball stemming compiler. There is a lot to do here, only the parser is partially complete.
Implement stemming