View Source Tokenizers.Decoder (Tokenizers v0.5.0)

Decoders and decoding functions.

Decoder transforms a sequence of token ids back to a readable piece of text.

Some normalizers and pre-tokenizers use special characters or identifiers that need special logic to be reverted.

Summary

Functions

Creates a BPE decoder.

Creates a ByteFallback decoder.

Creates a ByteLevel decoder.

Creates a CTC decoder.

Decodes tokens into string with provided decoder.

Creates a Fuse decoder.

Creates a Metaspace decoder.

Creates a Replace decoder.

Combines a list of decoders into a single sequential decoder.

Creates a Strip decoder.

Creates a WordPiece decoder.

Types

@type t() :: %Tokenizers.Decoder{resource: reference()}

Functions

@spec bpe(keyword()) :: t()

Creates a BPE decoder.

Options

  • suffix - the suffix to add to the end of each word. Defaults to </w>
@spec byte_fallback() :: t()

Creates a ByteFallback decoder.

@spec byte_level() :: t()

Creates a ByteLevel decoder.

@spec ctc(keyword()) :: t()

Creates a CTC decoder.

Options

  • pad_token - the token used for padding. Defaults to <pad>

  • word_delimiter_token - the token used for word delimiter. Defaults to |

  • cleanup - whether to cleanup tokenization artifacts, defaults to true

@spec decode(t(), [String.t()]) :: {:ok, String.t()} | {:error, any()}

Decodes tokens into string with provided decoder.

@spec fuse() :: t()

Creates a Fuse decoder.

@spec metaspace(keyword()) :: t()

Creates a Metaspace decoder.

Options

  • replacement - the replacement character. Defaults to (as char)

  • :prepend_scheme - whether to add a space to the first word if there isn't already one. This lets us treat "hello" exactly like "say hello". Either of :always, :never, :first. :first means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used). Defaults to :always

Link to this function

replace(pattern, content)

View Source
@spec replace(String.t(), String.t()) :: t()

Creates a Replace decoder.

@spec sequence(decoders :: [t()]) :: t()

Combines a list of decoders into a single sequential decoder.

Link to this function

strip(content, left, right)

View Source
@spec strip(char(), non_neg_integer(), non_neg_integer()) :: t()

Creates a Strip decoder.

It expects a character and the number of times to strip the character on left and right sides.

@spec word_piece(keyword()) :: t()

Creates a WordPiece decoder.

Options

  • prefix - The prefix to use for subwords. Defaults to ##

  • cleanup - Whether to cleanup tokenization artifacts. Defaults to true