Edifice.Meta.MixtureOfTokenizers (Edifice v0.2.0)

Copy Markdown View Source

Mixture of Tokenizers — multiple parallel embedding pathways with learned routing.

Uses N separate tokenizer embedding pathways, each with a different vocabulary size and embedding dimension, combined via learned soft routing weights. This allows the model to dynamically select the best tokenization granularity for each position.

Architecture

Input [batch, seq_len]
      |
+-- Tokenizer 1: embedding(vocab_1, embed_1) -> dense(hidden_size) --+
+-- Tokenizer 2: embedding(vocab_2, embed_2) -> dense(hidden_size) --+
+-- ...                                                              +
+-- Tokenizer N: embedding(vocab_N, embed_N) -> dense(hidden_size) --+
      |
Router: shared_embed -> dense(N) -> softmax -> weights [batch, seq_len, N]
      |
Weighted sum -> [batch, seq_len, hidden_size]
      |
Transformer blocks -> final norm -> last timestep
      |
[batch, hidden_size]

Usage

model = MixtureOfTokenizers.build(
  hidden_size: 256,
  num_tokenizers: 4,
  tokenizer_vocab_sizes: [256, 512, 1024, 2048],
  tokenizer_embed_dims: [32, 64, 128, 256]
)

References

  • "Mixture-of-Tokenizers" (Pham et al., 2024) — multi-granularity tokenization

Summary

Types

Options for build/1.

Functions

Build a Mixture of Tokenizers model.

Get the output size of the model.

Get recommended defaults.

Types

build_opt()

@type build_opt() ::
  {:hidden_size, pos_integer()}
  | {:num_tokenizers, pos_integer()}
  | {:tokenizer_vocab_sizes, [pos_integer()]}
  | {:tokenizer_embed_dims, [pos_integer()]}
  | {:num_layers, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_kv_heads, pos_integer()}
  | {:dropout, float()}
  | {:window_size, pos_integer()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build a Mixture of Tokenizers model.

Options

  • :hidden_size - Output hidden dimension (default: 256)
  • :num_tokenizers - Number of parallel tokenizer pathways (default: 4)
  • :tokenizer_vocab_sizes - List of vocab sizes per tokenizer (default: [256, 512, 1024, 2048])
  • :tokenizer_embed_dims - List of embedding dims per tokenizer (default: [32, 64, 128, 256])
  • :num_layers - Number of transformer layers after fusion (default: 4)
  • :num_heads - Number of attention heads (default: 4)
  • :num_kv_heads - Number of KV heads for GQA (default: 2)
  • :dropout - Dropout rate (default: 0.1)
  • :window_size - Expected sequence length (default: 60)

Returns

An Axon model outputting [batch, hidden_size].

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output size of the model.