ExNLP
View SourceA comprehensive Natural Language Processing library for Elixir, providing tokenization, stemming, ranking algorithms, similarity metrics, and text analysis tools. Inspired by Python's NLTK and designed with idiomatic Elixir patterns.
Features
- 🔤 Tokenization: Multiple tokenizers (standard, whitespace, regex, n-gram, keyword) with NLTK-inspired API
- ✂️ Stemming: Snowball stemmers for 7 languages (English, Spanish, Portuguese, French, German, Italian, Polish)
- 📊 Ranking Algorithms: TF-IDF and BM25 implementations for document ranking and search
- 🔍 Similarity Metrics: Levenshtein, Jaccard, Dice, Jaro-Winkler, Hamming distance, and more
- 🚫 Stopwords: Built-in stopword lists for 30+ languages
- 🔧 Text Filtering: Case conversion, length filtering, pattern replacement, and stopword removal
- 📈 Statistics: Term frequency, document frequency, corpus-level statistics
- 🔗 Co-occurrence Analysis: Term co-occurrence matrices and analysis
- 📝 N-grams: Character and word n-gram generation
- 🎯 Idiomatic Elixir: Clean, functional code following Elixir best practices
Installation
Add ex_nlp to your list of dependencies in mix.exs:
def deps do
[
{:ex_nlp, "~> 0.1.0"}
]
endThen run mix deps.get.
Quick Start
Tokenization
# Simple word tokenization
iex> ExNlp.Tokenizer.word_tokenize("Hello, world!")
["Hello", "world"]
# Get tokens with position and offset information
iex> ExNlp.Tokenizer.tokenize("Hello, world!")
[
%ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
%ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]
# Custom regex tokenizer
iex> ExNlp.Tokenizer.regexp_tokenize("abc123 def456", "\\d+")
["123", "456"]Stemming
# Stem words in multiple languages
iex> ExNlp.Snowball.stem("running", :english)
"run"
iex> ExNlp.Snowball.stem("caminando", :spanish)
"camin"
iex> ExNlp.Snowball.stem_words(["running", "jumping", "beautiful"], :english)
["run", "jump", "beauti"]
# Check supported languages
iex> ExNlp.Snowball.supported_languages()
[:english, :spanish, :portuguese, :french, :german, :italian, :polish]Ranking Algorithms
TF-IDF
# Calculate TF-IDF score for a term in a document
iex> documents = ["The quick brown fox", "A brown dog"]
iex> ExNlp.Ranking.TfIdf.calculate("fox", "The quick brown fox", documents)
0.5108256237659907
# With preprocessing options
iex> ExNlp.Ranking.TfIdf.calculate("running", "The runner is running fast", documents,
...> stem: true, language: :english, remove_stopwords: true)
0.6931471805599453BM25
# Score documents against a query
iex> documents = ["BM25 is a ranking function", "used by search engines"]
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"])
[1.8455076734299591, 1.0126973514850315]
# Rank documents with custom parameters
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"],
...> k1: 1.5, b: 0.75, stem: true, language: :english)
[1.923456, 1.123456]Similarity Metrics
# Levenshtein distance
iex> ExNlp.Similarity.levenshtein("kitten", "sitting")
3
# Levenshtein similarity (normalized)
iex> ExNlp.Similarity.levenshtein_similarity("kitten", "sitting")
0.5714285714285714
# Jaccard similarity
iex> ExNlp.Similarity.jaccard(["cat", "dog"], ["cat", "bird"])
0.3333333333333333
# Jaro-Winkler similarity
iex> ExNlp.Similarity.jaro_winkler_similarity("martha", "marhta")
0.9611111111111111
# Dice coefficient
iex> ExNlp.Similarity.dice_coefficient(["cat", "dog"], ["cat", "bird"])
0.5Stopwords
# Check if a word is a stopword
iex> ExNlp.Stopwords.is_stopword?("the", :english)
true
# Remove stopwords from a list
iex> words = ["the", "quick", "brown", "fox"]
iex> ExNlp.Stopwords.remove(words, :english)
["quick", "brown", "fox"]
# Get list of stopwords
iex> ExNlp.Stopwords.list(:english)
["a", "all", "and", "as", "at", ...]Text Filtering
# Build a filtering pipeline
iex> tokens = ExNlp.Tokenizer.tokenize("The Quick Brown Fox")
iex> tokens
...> |> ExNlp.Filter.lowercase()
...> |> ExNlp.Filter.stop_words(:english)
...> |> ExNlp.Filter.min_length(3)
[
%ExNlp.Token{text: "quick", ...},
%ExNlp.Token{text: "brown", ...},
%ExNlp.Token{text: "fox", ...}
]N-grams
# Character n-grams
iex> ExNlp.Ngram.char_ngrams("hello", 2)
["he", "el", "ll", "lo"]
# Word n-grams
iex> ExNlp.Ngram.word_ngrams(["the", "quick", "brown", "fox"], 2)
[["the", "quick"], ["quick", "brown"], ["brown", "fox"]]Statistics
# Term frequency in a document
iex> ExNlp.Statistics.term_frequency("cat", ["the", "cat", "sat", "on", "the", "mat"])
1
# Document frequency in a corpus
iex> corpus = [["cat", "dog"], ["cat", "bird"], ["dog", "fish"]]
iex> ExNlp.Statistics.document_frequency("cat", corpus)
2
# Most frequent terms
iex> corpus = [["cat", "dog"], ["cat", "cat"], ["dog"]]
iex> ExNlp.Statistics.most_frequent(corpus, 2)
[{"cat", 3}, {"dog", 2}]Co-occurrence Analysis
# Build co-occurrence matrix
iex> corpus = [["cat", "dog"], ["cat", "bird", "dog"], ["bird"]]
iex> matrix = ExNlp.Cooccurrence.cooccurrence_matrix(corpus)
iex> matrix["cat"]["dog"]
2
# Find co-occurring terms
iex> ExNlp.Cooccurrence.cooccurring_terms("cat", corpus, 2)
[{"dog", 2}, {"bird", 1}]Supported Languages
Stemming
- English - Porter2 algorithm (Porter stemmer v2)
- Spanish - Spanish stemmer
- Portuguese - Portuguese stemmer
- French - French stemmer
- German - German stemmer
- Italian - Italian stemmer
- Polish - Polish stemmer
Stopwords
Stopword lists are available for 30+ languages including: English, Spanish, Portuguese, French, German, Italian, Polish, Russian, Dutch, Swedish, Norwegian, Danish, Finnish, Turkish, Arabic, Chinese, and more. See priv/stopwords/ for the complete list.
Architecture
The library is organized into logical modules:
ExNlp.Tokenizer- Text tokenization with multiple strategiesExNlp.Snowball- Word stemming algorithmsExNlp.Ranking- Document ranking (TF-IDF, BM25)ExNlp.Similarity- String and set similarity metricsExNlp.Stopwords- Stopword detection and filteringExNlp.Filter- Token filtering and transformationExNlp.Statistics- Text and corpus statisticsExNlp.Cooccurrence- Term co-occurrence analysisExNlp.Ngram- N-gram generation
Performance
The library includes benchmark suites for critical operations. Run benchmarks with:
mix run benchmarks/tokenizer_bench.exs
mix run benchmarks/similarity_bench.exs
mix run benchmarks/ranking_bench.exs
Testing
Run the test suite with:
mix test
Documentation
Generate documentation with:
mix docs
Contributing
Contributions are welcome! This library aims to be a comprehensive NLP toolkit for Elixir. Areas for contribution:
- Additional language support for stemming
- More stopword lists
- Additional similarity metrics
- Performance optimizations
- Documentation improvements
Credits
- Stemming algorithms based on the Snowball Stemming Algorithms
- Inspired by Python's NLTK and spaCy
- Stopword lists compiled from various open sources
License
MIT License - see LICENSE file for details.