ExNlp.Ranking (ex_nlp v0.1.0)
View SourceUnified API for ranking algorithms.
This module provides access to various ranking algorithms for information retrieval and text mining, including TF-IDF and BM25.
Overview
Both TF-IDF and BM25 are ranking algorithms used to score documents based on their relevance to queries:
TF-IDF (Term Frequency-Inverse Document Frequency): A classic weighting scheme that reflects how important a word is to a document in a corpus. Used widely in text mining and information retrieval.
BM25 (Best Matching 25): An evolution of TF-IDF that addresses some limitations, particularly with document length normalization and term frequency saturation. Considered more effective for search engines.
Both algorithms share similar preprocessing capabilities:
- Tokenization (with custom tokenizers)
- Stemming (multiple languages)
- Stop word removal
Examples
# TF-IDF
iex> documents = ["The quick brown fox", "A brown dog"]
iex> ExNlp.Ranking.TfIdf.calculate("fox", "The quick brown fox", documents)
0.5108256237659907
# BM25
iex> documents = ["BM25 is a ranking function", "used by search engines"]
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"])
[1.8455076734299591, 1.0126973514850315]Module Structure
ExNlp.Ranking.TfIdf- TF-IDF implementationExNlp.Ranking.Bm25- BM25 ranking algorithmExNlp.Ranking.Base- Shared utilities for token processing
Reference:
Summary
Functions
Convenience function to score documents with BM25 (delegates to Bm25.score/3).
Convenience function to calculate TF-IDF (delegates to TfIdf.calculate/4).