LeXtract.Tokenizer (lextract v0.1.2)

Tokenization wrapper using Hugging Face Tokenizers library.

Provides token-level text analysis with character offset tracking, which is essential for text alignment in the extraction pipeline. Uses a GenServer to cache loaded tokenizers and avoid reloading overhead.

Default Tokenizer

By default, uses bert-base-uncased tokenizer for its balance of performance and Unicode handling. The tokenizer is loaded once and cached for the lifetime of the application.

Examples

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hello, world!")
iex> is_map(encoding)
true

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Test 😁")
iex> tokens = LeXtract.Tokenizer.get_tokens(encoding)
iex> is_list(tokens)
true

Summary

Types

encoding()

tokenizer_ref()

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

clear_cache()

Clears the tokenizer cache.

default_tokenizer()

Returns the default tokenizer instance.

find_sequence(map, needle, opts \\ [])

Finds a token sequence in the encoding.

get_offset(map, index)

Gets the character offset tuple {start, end} for token at index.

get_offsets(map)

Gets all offsets from encoding.

get_token(arg1, index)

Gets the token string at a specific index from encoding.

get_tokens(map)

Gets all tokens from encoding.

start_link(opts \\ [])

Starts the tokenizer cache GenServer.

tokenize(text, opts \\ [])

Tokenizes text and returns encoding with character offsets.

Types

encoding()

@type encoding() :: %{
  tokens: [String.t()],
  ids: [non_neg_integer()],
  offsets: [{non_neg_integer(), non_neg_integer()}],
  encoding: Tokenizers.Encoding.t(),
  text: String.t()
}

tokenizer_ref()

@type tokenizer_ref() :: Tokenizers.Tokenizer.t()

Functions

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

clear_cache()

@spec clear_cache() :: :ok

Clears the tokenizer cache.

Useful for testing or when you need to reload tokenizers. Not typically needed in production code.

default_tokenizer()

@spec default_tokenizer() :: {:ok, tokenizer_ref()} | {:error, Exception.t()}

Returns the default tokenizer instance.

The tokenizer is cached and reused across calls. This function will block if the tokenizer is currently being loaded.

Examples

iex> {:ok, tokenizer} = LeXtract.Tokenizer.default_tokenizer()
iex> is_struct(tokenizer)
true

find_sequence(map, needle, opts \\ [])

@spec find_sequence(encoding(), [String.t()], keyword()) ::
  {:ok, non_neg_integer(), non_neg_integer()} | :not_found

Finds a token sequence in the encoding.

Performs case-insensitive search by default. Returns the start and end indices (exclusive) of the first match, or :not_found if no match exists.

Options

:case_sensitive - Perform case-sensitive search (default: false)

Examples

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("The quick brown fox")
iex> LeXtract.Tokenizer.find_sequence(encoding, ["quick", "brown"])
{:ok, 1, 3}

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hello world")
iex> LeXtract.Tokenizer.find_sequence(encoding, ["missing"])
:not_found

get_offset(map, index)

@spec get_offset(encoding(), non_neg_integer()) ::
  {non_neg_integer(), non_neg_integer()} | nil

Gets the character offset tuple {start, end} for token at index.

Returns nil if the index is out of bounds.

Examples

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hello world")
iex> LeXtract.Tokenizer.get_offset(encoding, 0)
{0, 5}

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Test")
iex> LeXtract.Tokenizer.get_offset(encoding, 999)
nil

get_offsets(map)

@spec get_offsets(encoding()) :: [{non_neg_integer(), non_neg_integer()}]

Gets all offsets from encoding.

Examples

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hi!")
iex> offsets = LeXtract.Tokenizer.get_offsets(encoding)
iex> is_list(offsets)
true

get_token(arg1, index)

@spec get_token(encoding(), non_neg_integer()) :: String.t() | nil

Gets the token string at a specific index from encoding.

Returns nil if the index is out of bounds.

get_tokens(map)

@spec get_tokens(encoding()) :: [String.t()]

Gets all tokens from encoding.

Examples

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hello world")
iex> LeXtract.Tokenizer.get_tokens(encoding)
["hello", "world"]

start_link(opts \\ [])

@spec start_link(keyword()) :: GenServer.on_start()

Starts the tokenizer cache GenServer.

This is typically called automatically by the application supervisor.

tokenize(text, opts \\ [])

@spec tokenize(
  String.t(),
  keyword()
) :: {:ok, encoding()} | {:error, Exception.t()}

Tokenizes text and returns encoding with character offsets.

Options

:tokenizer - Custom tokenizer to use instead of default
:add_special_tokens - Whether to add special tokens (default: false)

Examples

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hello world")
iex> LeXtract.Tokenizer.get_tokens(encoding)
["hello", "world"]

iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Test émojis 🎉")
iex> tokens = LeXtract.Tokenizer.get_tokens(encoding)
iex> length(tokens) > 0
true