Text content analyzer with support for multiple word counting algorithms emulating different word processors.
This module is the default fallback analyzer, so it will process any content block type as if it were text.
Supported Algorithms
Prosody.TextAnalyzer supports three basic algorithms for counting words, each modelled
after a different word processor. These algorithms are:
:balanced: The default algorithm, which splits words in a way that matches human intution. Hyphenated words (fast-paced) and alternating words (and/or) are counted as separate words. Formatted numbers (1,234) are counted as single words. This is similar to what Apple Pages does.:minimal: This splits words on spaces, so thatfast-pacedandand/orare one word, butand / oris two words. This is most like Microsoft Word or LibreOffice Writer.:maximal: This splits words on space and punctuation, resulting in the highest word count.
The algorithm results are sometimes surprising, but are consistent:
| Example | :balanced | :minimal | :maximal |
|---|---|---|---|
two words | 2 | 2 | 2 |
and/or | 2 | 1 | 2 |
and / or | 2 | 2 | 2 |
fast-paced | 2 | 1 | 2 |
1,234.56 | 1 | 1 | 3 |
www.example.com | 1 | 1 | 3 |
bob@example.com | 1 | 1 | 3 |
A longer result on the sentence:
The CEO's Q3 buy/sell analysis shows revenue increased 23.8% year-over-year, reaching $4.2M through our e-commerce platform at shop.company.co.uk. Email investors@company.com for the full profit/loss report.
:balancedproduces 30 words:minimalproduces 25 words:maximalproduces 37 words
Contractions are always preserved as single words for all algorithms.
Options
Behaviour may be changed by providing configuration options to analyze/2.
:algorithm: The counting algorithm to use. If provided, must be one of:balanced,:minimal, or:maximal.
Explicit feature configuration may be provided with specific options:
:preserve_urls: Whether to count URLs as single words:preserve_emails: Whether to count emails as single words:preserve_numbers: Whether to count numbers as single words:skip_punctuation_words: Whether "words" that are just punctuation are skipped or counted:word_separators: A list of characters to make aString.pattern/0or a regular expression indicating how words should be separated. This may not be specified ifalgorithmis specified.
The different algorithms provide different defaults beyond their word_separators.
:balancedandminimalpreserve URLs, email addresses, and numbers, and skip punctuation "words".:maximalskips punctuation words but does not preserve URLs, email addresses, or numbers by.
It is permissible to specify algorithm: :maximal, preserve_urls: true, where the
maximal approach will be taken, but URLs will be counted as a single word.
If no :algorithm or :word_separators are provided, then algorithm: :balanced is
used.