ExNlp.Statistics (ex_nlp v0.1.0)
View SourceCorpus-level term statistics.
This module provides functions for calculating various statistics about terms in documents and corpora, useful for search engine analysis.
Examples
# Term frequency in a document
iex> ExNlp.Statistics.term_frequency("cat", ["the", "cat", "sat", "on", "the", "mat"])
1
# Document frequency in a corpus
iex> corpus = [["cat", "dog"], ["cat", "bird"], ["dog", "fish"]]
iex> ExNlp.Statistics.document_frequency("cat", corpus)
2
# Most frequent terms
iex> corpus = [["cat", "dog"], ["cat", "cat"], ["dog"]]
iex> ExNlp.Statistics.most_frequent(corpus, 2)
[{"cat", 3}, {"dog", 2}]
Summary
Functions
Calculates the average document length (number of tokens) in a corpus.
Counts the total number of occurrences of a term across all documents in a corpus.
Counts the number of documents containing a term in a corpus.
Returns the top N most frequent terms in a corpus.
Counts the number of occurrences of a term in a document.
Returns comprehensive statistics for a term in a corpus.
Returns the number of unique terms in a corpus.
Types
Functions
Calculates the average document length (number of tokens) in a corpus.
Examples
iex> corpus = [["cat", "dog"], ["cat"], ["dog", "fish", "bird"]]
iex> ExNlp.Statistics.average_document_length(corpus)
2.0
@spec collection_frequency(String.t(), corpus()) :: non_neg_integer()
Counts the total number of occurrences of a term across all documents in a corpus.
Examples
iex> corpus = [["cat", "dog"], ["cat", "cat"], ["dog"]]
iex> ExNlp.Statistics.collection_frequency("cat", corpus)
3
@spec document_frequency(String.t(), corpus()) :: non_neg_integer()
Counts the number of documents containing a term in a corpus.
Examples
iex> corpus = [["cat", "dog"], ["cat", "bird"], ["dog", "fish"]]
iex> ExNlp.Statistics.document_frequency("cat", corpus)
2
iex> corpus = [["cat", "dog"], ["cat", "bird"], ["dog", "fish"]]
iex> ExNlp.Statistics.document_frequency("bird", corpus)
1
@spec most_frequent(corpus(), non_neg_integer()) :: [{String.t(), non_neg_integer()}]
Returns the top N most frequent terms in a corpus.
Returns a list of {term, count} tuples sorted by frequency (descending).
Examples
iex> corpus = [["cat", "dog"], ["cat", "cat"], ["dog"]]
iex> ExNlp.Statistics.most_frequent(corpus, 2)
[{"cat", 3}, {"dog", 2}]
@spec term_frequency(String.t(), document()) :: non_neg_integer()
Counts the number of occurrences of a term in a document.
Examples
iex> ExNlp.Statistics.term_frequency("cat", ["the", "cat", "sat", "on", "the", "mat"])
1
iex> ExNlp.Statistics.term_frequency("the", ["the", "cat", "sat", "on", "the", "mat"])
2
@spec term_statistics(String.t(), corpus()) :: %{ term_frequency: float(), document_frequency: non_neg_integer(), collection_frequency: non_neg_integer(), documents: [non_neg_integer()] }
Returns comprehensive statistics for a term in a corpus.
Returns a map with:
:term_frequency- Average TF across documents containing the term:document_frequency- Number of documents containing the term:collection_frequency- Total occurrences in corpus:documents- List of document indices containing the term
Examples
iex> corpus = [["cat", "dog"], ["cat", "cat"], ["dog"]]
iex> stats = ExNlp.Statistics.term_statistics("cat", corpus)
iex> stats[:document_frequency]
2
iex> stats[:collection_frequency]
3
@spec vocabulary_size(corpus()) :: non_neg_integer()
Returns the number of unique terms in a corpus.
Examples
iex> corpus = [["cat", "dog"], ["cat", "bird"], ["dog"]]
iex> ExNlp.Statistics.vocabulary_size(corpus)
3