# Text v0.5.0 - Table of Contents Text analysis and processing for Elixir including ngram, language detection and more. ## Pages - [Text](readme.md) - [Changelog](changelog.md) - [LICENSE](license.md) - Guides - [Text classification — language identification](text_classification.md) - [Sentiment analysis](sentiment.md) - [Part-of-speech tagging and named-entity recognition](pos_ner.md) - [Keyword-in-context (KWIC) concordance](kwic.md) - [Word clouds](word_clouds.md) ## Modules - [Text](Text.md): Functions for basic text processing and analysis. - [Text.Clean](Text.Clean.md): Text cleanup utilities: HTML stripping, whitespace collapse, Unicode normalization, and mojibake repair. - [Text.Collocation](Text.Collocation.md): Extract statistically significant word bigrams from a token stream. - [Text.Data](Text.Data.md): Locates runtime data files used by Text modules, fetching them from upstream sources when permitted. - [Text.Distance](Text.Distance.md): String edit-distance algorithms. - [Text.Embedding](Text.Embedding.md): Word embeddings — load pre-trained vectors and compute similarity, nearest neighbours, and analogies. - [Text.Emoji](Text.Emoji.md): Emoji detection and short-name conversion. - [Text.Hyphenation](Text.Hyphenation.md): Hyphenation via Liang's algorithm with TeX hyphenation patterns. - [Text.IR](Text.IR.md): Information-retrieval scoring against an indexed corpus. - [Text.IR.Corpus](Text.IR.Corpus.md): An indexed corpus of documents for information-retrieval scoring. - [Text.Inflect.En](Text.Inflect.En.md): Pluralisation for the English language based on the paper [An Algorithmic Approach to English Pluralization](http://users.monash.edu/~damian/papers/HTML/Plurals.html). - [Text.KWIC](Text.KWIC.md): Keyword-In-Context concordance. - [Text.KWIC.Match](Text.KWIC.Match.md): A single keyword-in-context occurrence. - [Text.Language](Text.Language.md): Language tag utilities used across the package. - [Text.Language.Classifier.Fasttext](Text.Language.Classifier.Fasttext.md): Pure-Elixir port of fastText's `lid.176` language identification model. - [Text.Language.Classifier.Fasttext.Args](Text.Language.Classifier.Fasttext.Args.md): Training and model hyperparameters extracted from a fastText model file. - [Text.Language.Classifier.Fasttext.Detection](Text.Language.Classifier.Fasttext.Detection.md): The result of running fastText language identification on a piece of text. - [Text.Language.Classifier.Fasttext.Dictionary](Text.Language.Classifier.Fasttext.Dictionary.md): Vocabulary and label table parsed from a fastText model file. - [Text.Language.Classifier.Fasttext.Entry](Text.Language.Classifier.Fasttext.Entry.md): A single dictionary entry parsed from a fastText model file. - [Text.Language.Classifier.Fasttext.Features](Text.Language.Classifier.Fasttext.Features.md): Converts an input string into the flat list of input-matrix row indices that fastText averages to produce a feature vector. - [Text.Language.Classifier.Fasttext.Hash](Text.Language.Classifier.Fasttext.Hash.md): Bit-exact port of fastText's string hash function. - [Text.Language.Classifier.Fasttext.HuffmanTree](Text.Language.Classifier.Fasttext.HuffmanTree.md): The Huffman tree fastText constructs over output labels for hierarchical softmax inference. - [Text.Language.Classifier.Fasttext.Inference](Text.Language.Classifier.Fasttext.Inference.md): Forward-pass scoring for fastText models. - [Text.Language.Classifier.Fasttext.Locale](Text.Language.Classifier.Fasttext.Locale.md): Resolves a language detection into a CLDR-canonical locale string. - [Text.Language.Classifier.Fasttext.Model](Text.Language.Classifier.Fasttext.Model.md): A fully-loaded fastText model. - [Text.Language.Classifier.Fasttext.ModelLoader](Text.Language.Classifier.Fasttext.ModelLoader.md): Parses a fastText `.bin` model file into a `Text.Language.Classifier.Fasttext.Model` struct. - [Text.Language.Classifier.Fasttext.ScriptDetector](Text.Language.Classifier.Fasttext.ScriptDetector.md): Identifies the dominant Unicode script of a piece of text. - [Text.Language.Classifier.Fasttext.Subwords](Text.Language.Classifier.Fasttext.Subwords.md): Character n-gram extraction and input-matrix indexing for fastText models. - [Text.Language.Classifier.Fasttext.Tokenizer](Text.Language.Classifier.Fasttext.Tokenizer.md): Whitespace tokenizer matching fastText's `Dictionary::readWord`. - [Text.Lemma](Text.Lemma.md): Dictionary-driven lemmatization. - [Text.NER](Text.NER.md): Named-entity recognition via [Bumblebee](https://hex.pm/packages/bumblebee). - [Text.NER.Entity](Text.NER.Entity.md): A single named entity span. - [Text.Ngram](Text.Ngram.md): Compute ngrams and their counts from a given UTF8 string. - [Text.PII](Text.PII.md): Pattern-based detection and redaction of personally-identifiable information. - [Text.POS](Text.POS.md): Part-of-speech tagging via [Bumblebee](https://hex.pm/packages/bumblebee). - [Text.Phonetic.Cologne](Text.Phonetic.Cologne.md): Cologne phonetics (Kölner Phonetik), the German-language counterpart to Soundex. - [Text.Phonetic.DoubleMetaphone](Text.Phonetic.DoubleMetaphone.md): Double Metaphone phonetic encoding (Lawrence Philips, 2000). - [Text.Phonetic.Metaphone](Text.Phonetic.Metaphone.md): Metaphone phonetic encoding (Lawrence Philips, 1990). - [Text.Phonetic.NYSIIS](Text.Phonetic.NYSIIS.md): New York State Identification and Intelligence System (NYSIIS) phonetic encoding (Robert L. Taft, 1970). - [Text.Phonetic.Soundex](Text.Phonetic.Soundex.md): Soundex phonetic encoding (Russell-Odell, 1918). - [Text.Readability](Text.Readability.md): Readability metrics for English text. - [Text.Segment](Text.Segment.md): Locale-aware word and sentence segmentation. - [Text.Sentiment](Text.Sentiment.md): Sentiment analysis with multilingual support. - [Text.Sentiment.Backend](Text.Sentiment.Backend.md): Behaviour for sentiment-analysis backends. - [Text.Sentiment.Backends.Bumblebee](Text.Sentiment.Backends.Bumblebee.md): Neural sentiment backend backed by [Bumblebee](https://hex.pm/packages/bumblebee). - [Text.Sentiment.Backends.Lexicon](Text.Sentiment.Backends.Lexicon.md): Default sentiment backend — lexicon-based, multilingual via the bundled AFINN lexicons. - [Text.Sentiment.Lexicon](Text.Sentiment.Lexicon.md): Lexicon-based sentiment scoring. - [Text.Sentiment.Lexicons.AFINN](Text.Sentiment.Lexicons.AFINN.md): Bundled [AFINN](https://github.com/fnielsen/afinn) sentiment lexicons. - [Text.Similarity](Text.Similarity.md): Set- and vector-based string similarity coefficients. - [Text.Slug](Text.Slug.md): URL-safe slug generation with locale-aware Unicode folding. - [Text.Spell](Text.Spell.md): Spell correction. - [Text.Stopwords](Text.Stopwords.md): Bundled multilingual stopword lists. - [Text.Summarize](Text.Summarize.md): Extractive text summarization. - [Text.Syllable](Text.Syllable.md): Syllable counting for English words. - [Text.Truecase](Text.Truecase.md): Restore case to text that has been lowercased. - [Text.Word](Text.Word.md): Implements word counting for lists, streams and flows. - [Text.WordCloud](Text.WordCloud.md): Builds a weighted list of terms suitable for rendering as a word cloud. - [Text.WordCloud.Backend](Text.WordCloud.Backend.md): Behaviour for `Text.WordCloud` scoring backends. - [Text.WordCloud.Backends.Frequency](Text.WordCloud.Backends.Frequency.md): Trivial frequency-counting backend for `Text.WordCloud`. - [Text.WordCloud.Backends.KeyBERT](Text.WordCloud.Backends.KeyBERT.md): Neural keyword-extraction backend backed by [Bumblebee](https://hex.pm/packages/bumblebee). - [Text.WordCloud.Backends.RAKE](Text.WordCloud.Backends.RAKE.md): RAKE (Rapid Automatic Keyword Extraction) backend for `Text.WordCloud`. - [Text.WordCloud.Backends.TFIDF](Text.WordCloud.Backends.TFIDF.md): TF-IDF backend for `Text.WordCloud`. - [Text.WordCloud.Backends.TextRank](Text.WordCloud.Backends.TextRank.md): TextRank backend for `Text.WordCloud`. - [Text.WordCloud.Backends.YAKE](Text.WordCloud.Backends.YAKE.md): YAKE! (Yet Another Keyword Extractor) backend for `Text.WordCloud`. - [Text.WordCloud.Layout](Text.WordCloud.Layout.md): Wordle-style spiral layout for word-cloud rendering. - [Text.WordCloud.SVG](Text.WordCloud.SVG.md): Renders a laid-out word cloud as an SVG string. - [Text.WordFreq](Text.WordFreq.md): Word frequency lookup tables. ## Mix Tasks - [mix text.download_lemma_data](Mix.Tasks.Text.DownloadLemmaData.md): Downloads lemmatization dictionaries from the [`michmech/lemmatization-lists`](https://github.com/michmech/lemmatization-lists) upstream and places them in the configured `Text.Data` cache so `Text.Lemma` can load them with no further network access. - [mix text.download_lid176](Mix.Tasks.Text.DownloadLid176.md): Downloads the fastText `lid.176.bin` model file used by `Text.Language.Classifier.Fasttext` for language identification. - [mix text.download_models](Mix.Tasks.Text.DownloadModels.md): Pre-downloads every external model used by `:text` so that subsequent calls run without network access. - [mix text.gen_afinn_lexicons](Mix.Tasks.Text.GenAfinnLexicons.md): Converts the [AFINN](https://github.com/fnielsen/afinn) data vendored under `data/affin/` into per-language TSV files under `priv/sentiment/`, ready for compile-time loading by `Text.Sentiment.Lexicons.AFINN`. - [mix text.gen_golden_fixtures](Mix.Tasks.Text.GenGoldenFixtures.md): Runs the canonical test inputs through the reference fastText implementation and writes per-input top-K predictions to `test/fixtures/golden_predictions.json`. - [mix text.gen_stopwords](Mix.Tasks.Text.GenStopwords.md): Fetches the [stopwords-iso](https://github.com/stopwords-iso/stopwords-iso) bundle (a single JSON file mapping ISO 639-1 codes to lists of stopwords) and writes one plain-text file per language under `priv/stopwords/`.