Text.Sentiment.Lexicon (Text v0.5.0)

Copy Markdown View Source

Lexicon-based sentiment scoring.

Scores a piece of text by looking up each token in a polarity lexicon (a map from String.t/0 to a numeric score), summing the matched scores, and optionally adjusting for nearby negators and intensifiers.

This module is the deterministic engine underneath Text.Sentiment. Most callers use the higher-level facade; this one is exposed for callers who want to plug in a custom lexicon (an industry-specific vocabulary, a non-AFINN translation, an emoji lexicon, etc.).

Score semantics

The default English lexicon (AFINN-165) uses integer scores in -5..+5, with 0 reserved for neutral terms. Sums of these are unbounded; the engine returns the raw sum plus a normalised compound score in [-1.0, +1.0] derived via the formula

compound = sum / sqrt(sum² + α)

with α = 15 (matching VADER's normalisation). This tames the range without saturating too quickly: a sum of 5 yields about +0.79, a sum of 15 yields about +0.97, and a sum of 0 yields exactly 0.0.

Negation and intensifier handling

Two simple, well-understood adjustments are applied during scoring:

  • Negation: when one of the configured negation tokens ("not", "never", "no", etc.) appears in the :negation_window tokens immediately preceding a polarity-bearing token, that token's score is multiplied by -0.74 (the VADER scalar). This is a deliberate over-correction that captures the intuition that negation usually flips polarity but rarely with full magnitude.

  • Intensifiers: when one of the configured intensifier tokens ("very", "extremely", etc.) immediately precedes a polarity-bearing token, that token's score is multiplied by 1.293. Diminishers ("slightly", "barely") similarly multiply by 0.707. Both scalars come from VADER and are tunable via :intensifier_boost and :diminisher_factor.

These rules are deliberately limited — they don't handle multi-word negation, sarcasm, or domain-specific reversals. For higher-quality multilingual sentiment, see the planned Bumblebee-backed adapter.

Summary

Types

A polarity lexicon: token → numeric score.

The structured result returned by score/3.

Functions

Scores text against lexicon.

Types

lexicon()

@type lexicon() :: %{required(String.t()) => number()}

A polarity lexicon: token → numeric score.

result()

@type result() :: %{
  sum: float(),
  compound: float(),
  label: :positive | :negative | :neutral,
  tokens: non_neg_integer(),
  matched: non_neg_integer()
}

The structured result returned by score/3.

Functions

score(text, lexicon, options \\ [])

@spec score(String.t(), lexicon(), keyword()) :: result()

Scores text against lexicon.

Arguments

  • text is a UTF-8 string.

  • lexicon is a map from token to numeric score. Tokens are matched after the same case-folding the engine applies to text (lowercase by default; see :fold_case).

Options

  • :tokenizer — a one-arg function from string to token list. Defaults to &Text.Segment.words/1.

  • :fold_casetrue (default) lowercases tokens before lookup. Set false if your lexicon is case-sensitive.

  • :negators — a list of tokens that, when seen in the :negation_window tokens preceding a polarity-bearing token, flip its score. Defaults to a small set of English negators ("not", "never", "no", "none", "nobody", "nor", "neither", "cannot", "can't", "don't", "isn't", "won't", "wasn't").

  • :intensifiers — a list of tokens that, when immediately preceding a polarity-bearing token, boost its score. Defaults to a small set of English intensifiers.

  • :diminishers — a list of tokens that, when immediately preceding a polarity-bearing token, dampen its score. Defaults to a small set of English diminishers.

  • :negation_window — how many preceding tokens to scan for a negator. Defaults to 3.

  • :negation_scalar — multiplier applied when a negator is found. Defaults to -0.74.

  • :intensifier_boost — multiplier applied when an intensifier is found. Defaults to 1.293.

  • :diminisher_factor — multiplier applied when a diminisher is found. Defaults to 0.707.

  • :positive_threshold, :negative_threshold — compound-score cutoffs for the :label field. Defaults to 0.05 and -0.05.

Returns

A result/0 struct with:

  • :sum — the raw sum of matched (and adjusted) lexicon scores.

  • :compound — the normalised score in [-1.0, +1.0].

  • :label:positive, :negative, or :neutral based on the threshold cutoffs.

  • :tokens — total token count after tokenisation.

  • :matched — number of tokens that hit the lexicon.

Examples

iex> lexicon = %{"good" => 3, "bad" => -3, "great" => 4}
iex> result = Text.Sentiment.Lexicon.score("This is a good day", lexicon)
iex> result.label
:positive

iex> lexicon = %{"good" => 3, "bad" => -3}
iex> result = Text.Sentiment.Lexicon.score("not a bad outcome", lexicon)
iex> result.label
:positive