LeXtract.Tokenizer (lextract v0.1.2)
View SourceTokenization wrapper using Hugging Face Tokenizers library.
Provides token-level text analysis with character offset tracking, which is essential for text alignment in the extraction pipeline. Uses a GenServer to cache loaded tokenizers and avoid reloading overhead.
Default Tokenizer
By default, uses bert-base-uncased tokenizer for its balance of performance
and Unicode handling. The tokenizer is loaded once and cached for the lifetime
of the application.
Examples
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hello, world!")
iex> is_map(encoding)
true
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Test 😁")
iex> tokens = LeXtract.Tokenizer.get_tokens(encoding)
iex> is_list(tokens)
true
Summary
Functions
Returns a specification to start this module under a supervisor.
Clears the tokenizer cache.
Returns the default tokenizer instance.
Finds a token sequence in the encoding.
Gets the character offset tuple {start, end} for token at index.
Gets all offsets from encoding.
Gets the token string at a specific index from encoding.
Gets all tokens from encoding.
Starts the tokenizer cache GenServer.
Tokenizes text and returns encoding with character offsets.
Types
@type encoding() :: %{ tokens: [String.t()], ids: [non_neg_integer()], offsets: [{non_neg_integer(), non_neg_integer()}], encoding: Tokenizers.Encoding.t(), text: String.t() }
@type tokenizer_ref() :: Tokenizers.Tokenizer.t()
Functions
Returns a specification to start this module under a supervisor.
See Supervisor.
@spec clear_cache() :: :ok
Clears the tokenizer cache.
Useful for testing or when you need to reload tokenizers. Not typically needed in production code.
@spec default_tokenizer() :: {:ok, tokenizer_ref()} | {:error, Exception.t()}
Returns the default tokenizer instance.
The tokenizer is cached and reused across calls. This function will block if the tokenizer is currently being loaded.
Examples
iex> {:ok, tokenizer} = LeXtract.Tokenizer.default_tokenizer()
iex> is_struct(tokenizer)
true
@spec find_sequence(encoding(), [String.t()], keyword()) :: {:ok, non_neg_integer(), non_neg_integer()} | :not_found
Finds a token sequence in the encoding.
Performs case-insensitive search by default. Returns the start and end
indices (exclusive) of the first match, or :not_found if no match exists.
Options
:case_sensitive- Perform case-sensitive search (default: false)
Examples
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("The quick brown fox")
iex> LeXtract.Tokenizer.find_sequence(encoding, ["quick", "brown"])
{:ok, 1, 3}
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hello world")
iex> LeXtract.Tokenizer.find_sequence(encoding, ["missing"])
:not_found
@spec get_offset(encoding(), non_neg_integer()) :: {non_neg_integer(), non_neg_integer()} | nil
Gets the character offset tuple {start, end} for token at index.
Returns nil if the index is out of bounds.
Examples
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hello world")
iex> LeXtract.Tokenizer.get_offset(encoding, 0)
{0, 5}
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Test")
iex> LeXtract.Tokenizer.get_offset(encoding, 999)
nil
@spec get_offsets(encoding()) :: [{non_neg_integer(), non_neg_integer()}]
Gets all offsets from encoding.
Examples
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hi!")
iex> offsets = LeXtract.Tokenizer.get_offsets(encoding)
iex> is_list(offsets)
true
@spec get_token(encoding(), non_neg_integer()) :: String.t() | nil
Gets the token string at a specific index from encoding.
Returns nil if the index is out of bounds.
Gets all tokens from encoding.
Examples
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hello world")
iex> LeXtract.Tokenizer.get_tokens(encoding)
["hello", "world"]
@spec start_link(keyword()) :: GenServer.on_start()
Starts the tokenizer cache GenServer.
This is typically called automatically by the application supervisor.
@spec tokenize( String.t(), keyword() ) :: {:ok, encoding()} | {:error, Exception.t()}
Tokenizes text and returns encoding with character offsets.
Options
:tokenizer- Custom tokenizer to use instead of default:add_special_tokens- Whether to add special tokens (default: false)
Examples
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Hello world")
iex> LeXtract.Tokenizer.get_tokens(encoding)
["hello", "world"]
iex> {:ok, encoding} = LeXtract.Tokenizer.tokenize("Test émojis 🎉")
iex> tokens = LeXtract.Tokenizer.get_tokens(encoding)
iex> length(tokens) > 0
true