Text

Text & language processing for Elixir. Initial release focuses on:

  • [x] n-gram generation from text
  • [x] pluralization of english words
  • [x] word counting (word freqencies)
  • [x] language detection using pluggable classifier, vocabulary and corpus backends.

Second phase will focus on:

  • Stemming
  • tokenization and part-of-speech tagging (at least for english)
  • Sentiment analysis

Each of these phases requires prior development. See below.

Status Update Sept 2021

The Text project remains active and maintained. However with the advent of the amazing Numerical Elixir (Nx) project, many improved opportunities to leverage ML for text analysis open up and this is the planned path. I expect to focus using ML for the additional planned functionality as a calendar year 2022 project. Bug reports, PR and suggests are welcome!

Installation

def deps do
  [
    {:text, "~> 0.2.0"}
  ]
end

Word Counting

text contains an implementation of word counting that is oriented towards large streams of words rather than discrete strings. Input to Text.Word.word_count/2 can be a String.t, File.Stream.t or Flow.t allowing flexible streaming of text.

English Pluralization

text includes an inflector for the English language that takes an approach based upon An Algorithmic Approach to English Pluralization. See the module Text.Inflect.En and the functions:

Language Detection

text contains 3 language classifiers to aid in natural language detection. However it does not include any corpora; these are contained in separate libraries. The available classifiers are:

Additional classifiers can be added by defining a module that implements the Text.Language.Classifier behaviour.

The library text_corpus_udhr implements the Text.Corpus behaviour for the United National Declaration of Human Rights which is available for download in 423 languages from Unicode.

See Text.Language.detect/2.

N-Gram generation

The Text.Ngram module supports efficient generation of n-grams of length 2 to 7. See Text.Ngram.ngram/2.

Down the rabbit hole

Text analysis at a fundamental level requires segmenting arbitrary text in any language into characters (graphemes), words and sentences. This is a complex topic covered by the Unicode text segmentation standard agumented by localised rules in CLDR's segmentations data.

Therefore in order to provide higher order text analysis the order of development looks like this:

  1. Finish the Unicode regular expression engine in ex_unicode_set. Most of the work is complete but compound character classes needs further work. Unicode regular expressions are required to implement both Unicode transforms and Unicode segmentation

  2. Implement basic Unicode word and sentence segmentation in ex_unicode_string. Grapheme cluster segmentation is available in the standard library as String.graphemes/1

  3. Add CLDR tailorings for locale-specific segmentation of words and sentences.

  4. Finish up the Snowball stemming compiler. There is a lot to do here, only the parser is partially complete.

  5. Implement stemming