Text.Summarize (Text v0.5.0)

Extractive text summarization.

Picks the most representative sentences from a document and returns them in their original order. Two algorithms are implemented:

:textrank (default) — builds a sentence similarity graph and runs PageRank over it. Sentences that are similar to many other sentences score higher. From Mihalcea & Tarau (2004).
:lexrank — the same idea but with a hard similarity threshold: edges below the threshold are dropped before scoring. From Erkan & Radev (2004).

Both algorithms use Jaccard similarity over content words (with stopwords removed) as the inter-sentence weight. The graph construction and ranking are pure Elixir — no embeddings, no external models — so summarization works on any text the segmenter can handle.

This is suitable for documents from a few sentences to a few hundred. For very long documents, consider chunking first. Abstractive summarization (rewriting rather than selecting) is a Bumblebee-backed feature on the deferred list.

Summary

Functions

scores(text, options \\ [])

Returns the per-sentence importance scores from the chosen algorithm.

summarize(text, options \\ [])

Returns the most representative sentences from the input text.

summarize_sentences(text, options \\ [])

Returns the selected sentences as a list (rather than joined).

Functions

scores(text, options \\ [])

@spec scores(
  String.t(),
  keyword()
) :: [float()]

Returns the per-sentence importance scores from the chosen algorithm.

Useful for callers that want to render a heatmap of sentence importance, or implement their own selection policy on top of the raw scores.

Returns

A list of floats, one per sentence in the input, in document order.

summarize(text, options \\ [])

@spec summarize(
  String.t(),
  keyword()
) :: String.t()

Returns the most representative sentences from the input text.

Arguments

text is the document as a string.

Options

:sentences is the number of sentences to return. Default 3. If the document has fewer sentences than this, every sentence is returned.
:algorithm is :textrank (default) or :lexrank.
:language is the language atom used for sentence segmentation and stopword removal. Default :en.
:damping is the PageRank damping factor. Default 0.85.
:iterations is the number of PageRank iterations. Default 30.
:threshold is the LexRank similarity cutoff. Edges with similarity below this value are dropped. Only used by :lexrank. Default 0.1.

Returns

A string of selected sentences joined with single spaces, in original document order.

Examples

iex> text = "Cats are lovely pets. Dogs are loyal animals. Goldfish swim quietly. Birds sing in the morning."
iex> Text.Summarize.summarize(text, sentences: 2) |> String.contains?(".")
true

summarize_sentences(text, options \\ [])

@spec summarize_sentences(
  String.t(),
  keyword()
) :: [String.t()]

Returns the selected sentences as a list (rather than joined).

Same options as summarize/2. Useful when the caller wants to format the output (bullets, numbered lists) rather than receive pre-joined prose.

Examples

iex> text = "First sentence. Second sentence. Third sentence. Fourth sentence."
iex> result = Text.Summarize.summarize_sentences(text, sentences: 2)
iex> length(result)
2