Nasty.Operations.Summarization.Extractive behaviour (Nasty v0.3.0)

Language-agnostic extractive summarization algorithms.

Provides generic scoring and selection methods that work with any AST structure. Language-specific implementations provide configuration like stop words, discourse markers, and entity recognition.

Usage

defmodule MyLanguage.Summarizer do
  use Nasty.Operations.Summarization.Extractive

  @impl true
  def stop_words, do: ["a", "the", "is"]

  @impl true
  def discourse_markers, do: ["therefore", "conclusion"]

  @impl true
  def entity_recognizer, do: MyLanguage.EntityRecognizer
end

Summary

Callbacks

discourse_markers()

Callback for providing discourse markers.

entity_recognizer()

Callback for entity recognition module (optional).

extract_tokens(t)

Callback for extracting tokens from a sentence. Must be implemented by language-specific module.

stop_words()

Callback for providing stop words for keyword scoring.

Functions

calculate_max_similarity(sent, selected)

Calculates maximum similarity between a sentence and selected sentences.

calculate_target_count(sentences, max_sentences, ratio)

Calculates target number of sentences for summary.

coreference_score(sentence, position, coref_chains)

Coreference score: sentences participating in coref chains are important.

discourse_marker_score(impl, sentence)

Discourse marker score: signal words indicate importance.

entity_score(impl, sentence)

Entity score: sentences with named entities are more important.

extract_sentences(paragraphs)

Extracts all sentences from paragraphs.

jaccard_similarity(set1, set2)

Calculates Jaccard similarity between two term sets.

keyword_score(impl, sentence, all_sentences)

Keyword score based on term frequency.

length_score(impl, sentence)

Length score: prefer moderate-length sentences.

position_score(position, total)

Position score: earlier sentences are more important.

score_all_sentences(impl, sentences, coref_chains, opts)

Scores all sentences in a document.

score_sentence(impl, sentence, position, all_sentences, coref_chains, opts)

Scores a single sentence using multiple heuristics.

select_greedy(scored_sentences, count)

Greedy selection: pick top-N by score.

select_mmr(impl, scored_sentences, count, opts)

MMR selection: maximize relevance while minimizing redundancy.

summarize(impl, document, opts \\ [])

Summarizes a document using extractive methods.

Callbacks

discourse_markers()

@callback discourse_markers() :: [String.t()]

Callback for providing discourse markers.

entity_recognizer()

(optional)

@callback entity_recognizer() :: module() | nil

Callback for entity recognition module (optional).

extract_tokens(t)

@callback extract_tokens(Nasty.AST.Sentence.t()) :: [term()]

Callback for extracting tokens from a sentence. Must be implemented by language-specific module.

stop_words()

@callback stop_words() :: [String.t()]

Callback for providing stop words for keyword scoring.

Functions

calculate_max_similarity(sent, selected)

@spec calculate_max_similarity(Nasty.AST.Sentence.t(), [
  {Nasty.AST.Sentence.t(), integer(), float()}
]) ::
  float()

Calculates maximum similarity between a sentence and selected sentences.

calculate_target_count(sentences, max_sentences, ratio)

@spec calculate_target_count([Nasty.AST.Sentence.t()], integer() | nil, float()) ::
  integer()

Calculates target number of sentences for summary.

coreference_score(sentence, position, coref_chains)

@spec coreference_score(Nasty.AST.Sentence.t(), integer(), [term()]) :: float()

Coreference score: sentences participating in coref chains are important.

discourse_marker_score(impl, sentence)

@spec discourse_marker_score(module(), Nasty.AST.Sentence.t()) :: float()

Discourse marker score: signal words indicate importance.

entity_score(impl, sentence)

@spec entity_score(module(), Nasty.AST.Sentence.t()) :: float()

Entity score: sentences with named entities are more important.

extract_sentences(paragraphs)

@spec extract_sentences([Nasty.AST.Paragraph.t()]) :: [Nasty.AST.Sentence.t()]

Extracts all sentences from paragraphs.

jaccard_similarity(set1, set2)

@spec jaccard_similarity(MapSet.t(), MapSet.t()) :: float()

Calculates Jaccard similarity between two term sets.

keyword_score(impl, sentence, all_sentences)

@spec keyword_score(module(), Nasty.AST.Sentence.t(), [Nasty.AST.Sentence.t()]) ::
  float()

Keyword score based on term frequency.

length_score(impl, sentence)

@spec length_score(module(), Nasty.AST.Sentence.t()) :: float()

Length score: prefer moderate-length sentences.

position_score(position, total)

@spec position_score(integer(), integer()) :: float()

Position score: earlier sentences are more important.

score_all_sentences(impl, sentences, coref_chains, opts)

@spec score_all_sentences(module(), [Nasty.AST.Sentence.t()], [term()], keyword()) ::
  [
    {Nasty.AST.Sentence.t(), integer(), float()}
  ]

Scores all sentences in a document.

score_sentence(impl, sentence, position, all_sentences, coref_chains, opts)

@spec score_sentence(
  module(),
  Nasty.AST.Sentence.t(),
  integer(),
  [Nasty.AST.Sentence.t()],
  [term()],
  keyword()
) :: float()

Scores a single sentence using multiple heuristics.

Default weights

Position: 0.25
Length: 0.15
Entity: 0.25
Keyword: 0.15
Discourse: 0.10
Coreference: 0.10

select_greedy(scored_sentences, count)

@spec select_greedy([{Nasty.AST.Sentence.t(), integer(), float()}], integer()) :: [
  {Nasty.AST.Sentence.t(), integer(), float()}
]

Greedy selection: pick top-N by score.

select_mmr(impl, scored_sentences, count, opts)

@spec select_mmr(
  module(),
  [{Nasty.AST.Sentence.t(), integer(), float()}],
  integer(),
  keyword()
) :: [
  {Nasty.AST.Sentence.t(), integer(), float()}
]

MMR selection: maximize relevance while minimizing redundancy.

summarize(impl, document, opts \\ [])

@spec summarize(module(), Nasty.AST.Document.t(), keyword()) :: [
  Nasty.AST.Sentence.t()
]

Summarizes a document using extractive methods.

Options

:ratio - Compression ratio (0.0 to 1.0), default 0.3
:max_sentences - Maximum number of sentences in summary
:min_sentence_length - Minimum sentence length (in tokens)
:method - Selection method: :greedy or :mmr (default: :greedy)
:mmr_lambda - MMR diversity parameter, 0-1 (default: 0.5)
:score_weights - Custom weights for scoring components (map)

Returns a list of selected sentences in document order.