Text.Extract.Boundary (Text v0.6.1)

Copy Markdown View Source

Phase 3 of the URL / email extraction pipeline: trim spurious trailing punctuation from a candidate span.

Real-world prose embeds URLs into sentences. The text "See http://example.com." should yield the URL http://example.com with the sentence-final period dropped, while "https://en.wikipedia.org/wiki/URI_(disambiguation)" should keep the closing parenthesis because it has a matching opener inside the span.

This module produces a (possibly shorter) {start, length} span by:

  1. Stripping trailing characters from a fixed punctuation set (.,;:!?'").

  2. Stripping trailing closing brackets (), ], }, >) that have no matching opener inside the span.

Steps repeat until no further trimming is possible.

Examples

iex> Text.Extract.Boundary.shrink("http://example.com.")
"http://example.com"

iex> Text.Extract.Boundary.shrink("http://example.com)")
"http://example.com"

iex> Text.Extract.Boundary.shrink("http://en.wikipedia.org/wiki/URI_(disambiguation)")
"http://en.wikipedia.org/wiki/URI_(disambiguation)"

iex> Text.Extract.Boundary.shrink("http://x.com/path......")
"http://x.com/path"

Summary

Functions

Trims trailing punctuation and unbalanced closers from a candidate string.

Functions

shrink(candidate)

@spec shrink(String.t()) :: String.t()

Trims trailing punctuation and unbalanced closers from a candidate string.

Arguments

Returns

  • The candidate with trailing junk removed. Never grows; only the end of the string is trimmed.