View Source Tokenizers.Decoder (Tokenizers v0.5.1)

Decoders and decoding functions.

Decoder transforms a sequence of token ids back to a readable piece of text.

Some normalizers and pre-tokenizers use special characters or identifiers that need special logic to be reverted.

Summary

Types

t()

Functions

bpe(opts \\ [])

Creates a BPE decoder.

byte_fallback()

Creates a ByteFallback decoder.

byte_level()

Creates a ByteLevel decoder.

ctc(opts \\ [])

Creates a CTC decoder.

decode(decoder, tokens)

Decodes tokens into string with provided decoder.

fuse()

Creates a Fuse decoder.

metaspace(opts \\ [])

Creates a Metaspace decoder.

replace(pattern, content)

Creates a Replace decoder.

sequence(decoders)

Combines a list of decoders into a single sequential decoder.

strip(content, left, right)

Creates a Strip decoder.

word_piece(opts \\ [])

Creates a WordPiece decoder.

Types

t()

@type t() :: %Tokenizers.Decoder{resource: reference()}

Functions

bpe(opts \\ [])

@spec bpe(keyword()) :: t()

Creates a BPE decoder.

Options

:suffix - the suffix to add to the end of each word. Defaults to </w>

byte_fallback()

@spec byte_fallback() :: t()

Creates a ByteFallback decoder.

byte_level()

@spec byte_level() :: t()

Creates a ByteLevel decoder.

ctc(opts \\ [])

@spec ctc(keyword()) :: t()

Creates a CTC decoder.

Options

:pad_token - the token used for padding. Defaults to <pad>
:word_delimiter_token - the token used for word delimiter. Defaults to |
:cleanup - whether to cleanup tokenization artifacts, defaults to true

decode(decoder, tokens)

@spec decode(t(), [String.t()]) :: {:ok, String.t()} | {:error, any()}

Decodes tokens into string with provided decoder.

fuse()

@spec fuse() :: t()

Creates a Fuse decoder.

metaspace(opts \\ [])

@spec metaspace(keyword()) :: t()

Creates a Metaspace decoder.

Options

:replacement - the replacement character. Defaults to ▁ (as char)
:prepend_scheme - whether to add a space to the first word if there isn't already one. This lets us treat "hello" exactly like "say hello". Either of :always, :never, :first. :first means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used). Defaults to :always

replace(pattern, content)

@spec replace(String.t(), String.t()) :: t()

Creates a Replace decoder.

sequence(decoders)

@spec sequence(decoders :: [t()]) :: t()

Combines a list of decoders into a single sequential decoder.

strip(content, left, right)

@spec strip(char(), non_neg_integer(), non_neg_integer()) :: t()

Creates a Strip decoder.

It expects a character and the number of times to strip the character on left and right sides.

word_piece(opts \\ [])

@spec word_piece(keyword()) :: t()

Creates a WordPiece decoder.

Options

:prefix - The prefix to use for subwords. Defaults to ##
:cleanup - Whether to cleanup tokenization artifacts. Defaults to true