Nasty.Operations.Summarization.Extractive behaviour (Nasty v0.3.0)
View SourceLanguage-agnostic extractive summarization algorithms.
Provides generic scoring and selection methods that work with any AST structure. Language-specific implementations provide configuration like stop words, discourse markers, and entity recognition.
Usage
defmodule MyLanguage.Summarizer do
use Nasty.Operations.Summarization.Extractive
@impl true
def stop_words, do: ["a", "the", "is"]
@impl true
def discourse_markers, do: ["therefore", "conclusion"]
@impl true
def entity_recognizer, do: MyLanguage.EntityRecognizer
end
Summary
Callbacks
Callback for providing discourse markers.
Callback for entity recognition module (optional).
Callback for extracting tokens from a sentence. Must be implemented by language-specific module.
Callback for providing stop words for keyword scoring.
Functions
Calculates maximum similarity between a sentence and selected sentences.
Calculates target number of sentences for summary.
Coreference score: sentences participating in coref chains are important.
Discourse marker score: signal words indicate importance.
Entity score: sentences with named entities are more important.
Extracts all sentences from paragraphs.
Calculates Jaccard similarity between two term sets.
Keyword score based on term frequency.
Length score: prefer moderate-length sentences.
Position score: earlier sentences are more important.
Scores all sentences in a document.
Scores a single sentence using multiple heuristics.
Greedy selection: pick top-N by score.
MMR selection: maximize relevance while minimizing redundancy.
Summarizes a document using extractive methods.
Callbacks
@callback discourse_markers() :: [String.t()]
Callback for providing discourse markers.
@callback entity_recognizer() :: module() | nil
Callback for entity recognition module (optional).
@callback extract_tokens(Nasty.AST.Sentence.t()) :: [term()]
Callback for extracting tokens from a sentence. Must be implemented by language-specific module.
@callback stop_words() :: [String.t()]
Callback for providing stop words for keyword scoring.
Functions
@spec calculate_max_similarity(Nasty.AST.Sentence.t(), [ {Nasty.AST.Sentence.t(), integer(), float()} ]) :: float()
Calculates maximum similarity between a sentence and selected sentences.
@spec calculate_target_count([Nasty.AST.Sentence.t()], integer() | nil, float()) :: integer()
Calculates target number of sentences for summary.
@spec coreference_score(Nasty.AST.Sentence.t(), integer(), [term()]) :: float()
Coreference score: sentences participating in coref chains are important.
@spec discourse_marker_score(module(), Nasty.AST.Sentence.t()) :: float()
Discourse marker score: signal words indicate importance.
@spec entity_score(module(), Nasty.AST.Sentence.t()) :: float()
Entity score: sentences with named entities are more important.
@spec extract_sentences([Nasty.AST.Paragraph.t()]) :: [Nasty.AST.Sentence.t()]
Extracts all sentences from paragraphs.
Calculates Jaccard similarity between two term sets.
@spec keyword_score(module(), Nasty.AST.Sentence.t(), [Nasty.AST.Sentence.t()]) :: float()
Keyword score based on term frequency.
@spec length_score(module(), Nasty.AST.Sentence.t()) :: float()
Length score: prefer moderate-length sentences.
Position score: earlier sentences are more important.
@spec score_all_sentences(module(), [Nasty.AST.Sentence.t()], [term()], keyword()) :: [ {Nasty.AST.Sentence.t(), integer(), float()} ]
Scores all sentences in a document.
@spec score_sentence( module(), Nasty.AST.Sentence.t(), integer(), [Nasty.AST.Sentence.t()], [term()], keyword() ) :: float()
Scores a single sentence using multiple heuristics.
Default weights
- Position: 0.25
- Length: 0.15
- Entity: 0.25
- Keyword: 0.15
- Discourse: 0.10
- Coreference: 0.10
@spec select_greedy([{Nasty.AST.Sentence.t(), integer(), float()}], integer()) :: [ {Nasty.AST.Sentence.t(), integer(), float()} ]
Greedy selection: pick top-N by score.
@spec select_mmr( module(), [{Nasty.AST.Sentence.t(), integer(), float()}], integer(), keyword() ) :: [ {Nasty.AST.Sentence.t(), integer(), float()} ]
MMR selection: maximize relevance while minimizing redundancy.
@spec summarize(module(), Nasty.AST.Document.t(), keyword()) :: [ Nasty.AST.Sentence.t() ]
Summarizes a document using extractive methods.
Options
:ratio- Compression ratio (0.0 to 1.0), default 0.3:max_sentences- Maximum number of sentences in summary:min_sentence_length- Minimum sentence length (in tokens):method- Selection method::greedyor:mmr(default::greedy):mmr_lambda- MMR diversity parameter, 0-1 (default: 0.5):score_weights- Custom weights for scoring components (map)
Returns a list of selected sentences in document order.