Phase 3 of the URL / email extraction pipeline: trim spurious trailing punctuation from a candidate span.
Real-world prose embeds URLs into sentences. The text
"See http://example.com." should yield the URL http://example.com
with the sentence-final period dropped, while
"https://en.wikipedia.org/wiki/URI_(disambiguation)" should keep
the closing parenthesis because it has a matching opener inside the
span.
This module produces a (possibly shorter) {start, length} span by:
Stripping trailing characters from a fixed punctuation set (
.,;:!?'").Stripping trailing closing brackets (
),],},>) that have no matching opener inside the span.
Steps repeat until no further trimming is possible.
Examples
iex> Text.Extract.Boundary.shrink("http://example.com.")
"http://example.com"
iex> Text.Extract.Boundary.shrink("http://example.com)")
"http://example.com"
iex> Text.Extract.Boundary.shrink("http://en.wikipedia.org/wiki/URI_(disambiguation)")
"http://en.wikipedia.org/wiki/URI_(disambiguation)"
iex> Text.Extract.Boundary.shrink("http://x.com/path......")
"http://x.com/path"
Summary
Functions
Trims trailing punctuation and unbalanced closers from a candidate string.
Functions
Trims trailing punctuation and unbalanced closers from a candidate string.
Arguments
candidateis the candidate substring (e.g. as emitted byText.Extract.Scanner.scan/1).
Returns
- The candidate with trailing junk removed. Never grows; only the end of the string is trimmed.