# `Text.Summarize`
[🔗](https://github.com/kipcole9/text/blob/v0.5.0/lib/summarize.ex#L1)

Extractive text summarization.

Picks the most representative sentences from a document and
returns them in their original order. Two algorithms are
implemented:

* `:textrank` (default) — builds a sentence similarity graph and
  runs PageRank over it. Sentences that are similar to many other
  sentences score higher. From Mihalcea & Tarau (2004).

* `:lexrank` — the same idea but with a hard similarity threshold:
  edges below the threshold are dropped before scoring. From
  Erkan & Radev (2004).

Both algorithms use Jaccard similarity over content words (with
stopwords removed) as the inter-sentence weight. The graph
construction and ranking are pure Elixir — no embeddings, no
external models — so summarization works on any text the
segmenter can handle.

This is suitable for documents from a few sentences to a few
hundred. For very long documents, consider chunking first.
Abstractive summarization (rewriting rather than selecting) is a
Bumblebee-backed feature on the deferred list.

# `scores`

```elixir
@spec scores(
  String.t(),
  keyword()
) :: [float()]
```

Returns the per-sentence importance scores from the chosen
algorithm.

Useful for callers that want to render a heatmap of sentence
importance, or implement their own selection policy on top of
the raw scores.

### Returns

* A list of floats, one per sentence in the input, in document order.

# `summarize`

```elixir
@spec summarize(
  String.t(),
  keyword()
) :: String.t()
```

Returns the most representative sentences from the input text.

### Arguments

* `text` is the document as a string.

### Options

* `:sentences` is the number of sentences to return. Default `3`.
  If the document has fewer sentences than this, every sentence
  is returned.

* `:algorithm` is `:textrank` (default) or `:lexrank`.

* `:language` is the language atom used for sentence segmentation
  and stopword removal. Default `:en`.

* `:damping` is the PageRank damping factor. Default `0.85`.

* `:iterations` is the number of PageRank iterations. Default `30`.

* `:threshold` is the LexRank similarity cutoff. Edges with
  similarity below this value are dropped. Only used by `:lexrank`.
  Default `0.1`.

### Returns

* A string of selected sentences joined with single spaces, in
  original document order.

### Examples

    iex> text = "Cats are lovely pets. Dogs are loyal animals. Goldfish swim quietly. Birds sing in the morning."
    iex> Text.Summarize.summarize(text, sentences: 2) |> String.contains?(".")
    true

# `summarize_sentences`

```elixir
@spec summarize_sentences(
  String.t(),
  keyword()
) :: [String.t()]
```

Returns the selected sentences as a list (rather than joined).

Same options as `summarize/2`. Useful when the caller wants to
format the output (bullets, numbered lists) rather than receive
pre-joined prose.

### Examples

    iex> text = "First sentence. Second sentence. Third sentence. Fourth sentence."
    iex> result = Text.Summarize.summarize_sentences(text, sentences: 2)
    iex> length(result)
    2

---

*Consult [api-reference.md](api-reference.md) for complete listing*