Newxp.SimilarityUtils (newxp v0.1.1)

Copy Markdown

Summary

Functions

Calculate Jaccard similarity between two token lists over n-grams up to n_range.

Extract n-grams from a list of tokens.

Lowercase and split text into word tokens.

Functions

jaccard_similarity(text_1_tokens, text_2_tokens, n_range)

Calculate Jaccard similarity between two token lists over n-grams up to n_range.

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

Score range is 0.0–1.0, where 1.0 means identical.

ngrams(tokens, n)

Extract n-grams from a list of tokens.

normalize_text(text)

Lowercase and split text into word tokens.