View Source TextParser

CI Hex pm hex.pm downloads

TextParser is an Elixir library for extracting and validating structured tokens from text, such as URLs, hashtags, and mentions. It provides built-in token types and allows you to define custom parsers with specific validation rules.

This library was extracted from justcrosspost.app where processing tags, mentions and URLs for Bluesky is kinda tricky.

Installation

Add text_parser to your list of dependencies in mix.exs:

def deps do
  [
    {:text_parser, "~> 0.1"}
  ]
end

Usage

Basic Parsing

By default, TextParser extracts URLs, hashtags, and mentions:

text = "Check out https://elixir-lang.org #elixir @joe"

result = TextParser.parse(text)

# Get URLs
urls = TextParser.get(result, TextParser.Tokens.URL)
# => [%TextParser.Tokens.URL{value: "https://elixir-lang.org", position: {10, 32}}]

# Get hashtags
tags = TextParser.get(result, TextParser.Tokens.Tag)
# => [%TextParser.Tokens.Tag{value: "#elixir", position: {33, 40}}]

# Get mentions
mentions = TextParser.get(result, TextParser.Tokens.Mention)
# => [%TextParser.Tokens.Mention{value: "@joe", position: {41, 45}}]

Selective Token Extraction

You can specify which token types to extract:

alias TextParser.Tokens.{URL, Tag}

# Extract only URLs and hashtags
result = TextParser.parse("Check out https://elixir-lang.org #elixir @joe", extract: [URL, Tag])

TextParser.get(result, URL) # => [%URL{...}]
TextParser.get(result, Tag) # => [%Tag{...}]

Custom Tokens

You can create custom token types to extract different patterns from text. Here's an example of a token that extracts ISO 8601 dates:

defmodule MyParser.Tokens.Date do
  use TextParser.Token,
    # Match YYYY-MM-DD format, requiring space or start of string before the date
    pattern: ~r/(?:^|\s)(\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01]))/,
    trim_chars: [",", ".", "!", "?"]

  @impl true
  def is_valid?(date_text) when is_binary(date_text) do
    case Date.from_iso8601(date_text) do
      {:ok, _date} -> true
      _ -> false
    end
  end

  def is_valid?(_), do: false
end

# Usage
text = "Meeting on 2024-01-15, conference on 2024-02-30, party on 2024-12-31!"
result = TextParser.parse(text, extract: [MyParser.Tokens.Date])

dates = TextParser.get(result, MyParser.Tokens.Date)
# => [
#   %MyParser.Tokens.Date{value: "2024-01-15", position: {11, 21}},
#   %MyParser.Tokens.Date{value: "2024-12-31", position: {47, 57}}
# ]
# Note: 2024-02-30 is filtered out as it's not a valid date

Custom tokens require:

  1. A regex pattern that captures the token in the first capture group
  2. Optional trim_chars to remove trailing punctuation
  3. An is_valid?/1 function that validates the extracted value

You can mix custom tokens with built-in ones:

alias TextParser.Tokens.{URL, Tag}
alias MyParser.Tokens.Date

result = TextParser.parse(
  "Meeting on 2024-01-15 at https://example.com #elixir",
  extract: [URL, Tag, Date]
)

TextParser.get(result, Date)  # => [%Date{value: "2024-01-15", position: {11, 21}}]
TextParser.get(result, URL)   # => [%URL{value: "https://example.com", position: {25, 44}}]
TextParser.get(result, Tag)   # => [%Tag{value: "#elixir", position: {45, 52}}]

Custom Parsers

You can create custom parsers with specific validation rules:

defmodule BlueskyParser do
  use TextParser

  alias TextParser.Tokens.Tag

  # Enforce Bluesky's 64-character limit for hashtags
  def validate(%Tag{value: value} = tag) do
    if String.length(value) >= 66,
      do: {:error, "tag too long"},
      else: {:ok, tag}
  end

  # Allow other token types to pass through
  def validate(token), do: {:ok, token}
end

# Usage
text = "Check out #elixir and #this_is_a_very_long_hashtag_that_exceeds_bluesky_limit"

result = BlueskyParser.parse(text)

# Only valid tags are included
TextParser.get(result, Tag)
# => [%Tag{value: "#elixir", position: {10, 17}}]

Token Information

Each token includes its value and position in the original text:

text = "Check https://elixir-lang.org #elixir"
result = TextParser.parse(text)

[url] = TextParser.get(result, TextParser.Tokens.URL)

url.value     # => "https://elixir-lang.org"
url.position  # => {6, 29} (start and end byte positions)

:warning: The position tuple uses a range format where:

  • The start position is inclusive (the first byte of the token)
  • The end position is exclusive (one byte past the last byte of the token)

For example, with position: {6, 29}, the token starts at byte 6 and ends at byte 28. This format makes it easy to:

  • Calculate token length: end - start (29 - 6 = 23 bytes)
  • Extract the token from text: binary_part(text, start, end - start)

Documentation

The full documentation can be found at https://hexdocs.pm/text_parser.