View Source TextParser
TextParser is an Elixir library for extracting and validating structured tokens from text, such as URLs, hashtags, and mentions. It provides built-in token types and allows you to define custom parsers with specific validation rules.
This library was extracted from justcrosspost.app where processing tags, mentions and URLs for Bluesky is kinda tricky.
Installation
Add text_parser
to your list of dependencies in mix.exs
:
def deps do
[
{:text_parser, "~> 0.1"}
]
end
Usage
Basic Parsing
By default, TextParser extracts URLs, hashtags, and mentions:
text = "Check out https://elixir-lang.org #elixir @joe"
result = TextParser.parse(text)
# Get URLs
urls = TextParser.get(result, TextParser.Tokens.URL)
# => [%TextParser.Tokens.URL{value: "https://elixir-lang.org", position: {10, 32}}]
# Get hashtags
tags = TextParser.get(result, TextParser.Tokens.Tag)
# => [%TextParser.Tokens.Tag{value: "#elixir", position: {33, 40}}]
# Get mentions
mentions = TextParser.get(result, TextParser.Tokens.Mention)
# => [%TextParser.Tokens.Mention{value: "@joe", position: {41, 45}}]
Selective Token Extraction
You can specify which token types to extract:
alias TextParser.Tokens.{URL, Tag}
# Extract only URLs and hashtags
result = TextParser.parse("Check out https://elixir-lang.org #elixir @joe", extract: [URL, Tag])
TextParser.get(result, URL) # => [%URL{...}]
TextParser.get(result, Tag) # => [%Tag{...}]
Custom Tokens
You can create custom token types to extract different patterns from text. Here's an example of a token that extracts ISO 8601 dates:
defmodule MyParser.Tokens.Date do
use TextParser.Token,
# Match YYYY-MM-DD format, requiring space or start of string before the date
pattern: ~r/(?:^|\s)(\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01]))/,
trim_chars: [",", ".", "!", "?"]
@impl true
def is_valid?(date_text) when is_binary(date_text) do
case Date.from_iso8601(date_text) do
{:ok, _date} -> true
_ -> false
end
end
def is_valid?(_), do: false
end
# Usage
text = "Meeting on 2024-01-15, conference on 2024-02-30, party on 2024-12-31!"
result = TextParser.parse(text, extract: [MyParser.Tokens.Date])
dates = TextParser.get(result, MyParser.Tokens.Date)
# => [
# %MyParser.Tokens.Date{value: "2024-01-15", position: {11, 21}},
# %MyParser.Tokens.Date{value: "2024-12-31", position: {47, 57}}
# ]
# Note: 2024-02-30 is filtered out as it's not a valid date
Custom tokens require:
- A regex
pattern
that captures the token in the first capture group - Optional
trim_chars
to remove trailing punctuation - An
is_valid?/1
function that validates the extracted value
You can mix custom tokens with built-in ones:
alias TextParser.Tokens.{URL, Tag}
alias MyParser.Tokens.Date
result = TextParser.parse(
"Meeting on 2024-01-15 at https://example.com #elixir",
extract: [URL, Tag, Date]
)
TextParser.get(result, Date) # => [%Date{value: "2024-01-15", position: {11, 21}}]
TextParser.get(result, URL) # => [%URL{value: "https://example.com", position: {25, 44}}]
TextParser.get(result, Tag) # => [%Tag{value: "#elixir", position: {45, 52}}]
Custom Parsers
You can create custom parsers with specific validation rules:
defmodule BlueskyParser do
use TextParser
alias TextParser.Tokens.Tag
# Enforce Bluesky's 64-character limit for hashtags
def validate(%Tag{value: value} = tag) do
if String.length(value) >= 66,
do: {:error, "tag too long"},
else: {:ok, tag}
end
# Allow other token types to pass through
def validate(token), do: {:ok, token}
end
# Usage
text = "Check out #elixir and #this_is_a_very_long_hashtag_that_exceeds_bluesky_limit"
result = BlueskyParser.parse(text)
# Only valid tags are included
TextParser.get(result, Tag)
# => [%Tag{value: "#elixir", position: {10, 17}}]
Token Information
Each token includes its value and position in the original text:
text = "Check https://elixir-lang.org #elixir"
result = TextParser.parse(text)
[url] = TextParser.get(result, TextParser.Tokens.URL)
url.value # => "https://elixir-lang.org"
url.position # => {6, 29} (start and end byte positions)
:warning: The position
tuple uses a range format where:
- The start position is inclusive (the first byte of the token)
- The end position is exclusive (one byte past the last byte of the token)
For example, with position: {6, 29}
, the token starts at byte 6 and ends at byte 28. This format makes it easy to:
- Calculate token length:
end - start
(29 - 6 = 23 bytes) - Extract the token from text:
binary_part(text, start, end - start)
Documentation
The full documentation can be found at https://hexdocs.pm/text_parser.