Mixture of Tokenizers — multiple parallel embedding pathways with learned routing.
Uses N separate tokenizer embedding pathways, each with a different vocabulary size and embedding dimension, combined via learned soft routing weights. This allows the model to dynamically select the best tokenization granularity for each position.
Architecture
Input [batch, seq_len]
|
+-- Tokenizer 1: embedding(vocab_1, embed_1) -> dense(hidden_size) --+
+-- Tokenizer 2: embedding(vocab_2, embed_2) -> dense(hidden_size) --+
+-- ... +
+-- Tokenizer N: embedding(vocab_N, embed_N) -> dense(hidden_size) --+
|
Router: shared_embed -> dense(N) -> softmax -> weights [batch, seq_len, N]
|
Weighted sum -> [batch, seq_len, hidden_size]
|
Transformer blocks -> final norm -> last timestep
|
[batch, hidden_size]Usage
model = MixtureOfTokenizers.build(
hidden_size: 256,
num_tokenizers: 4,
tokenizer_vocab_sizes: [256, 512, 1024, 2048],
tokenizer_embed_dims: [32, 64, 128, 256]
)References
- "Mixture-of-Tokenizers" (Pham et al., 2024) — multi-granularity tokenization
Summary
Functions
Build a Mixture of Tokenizers model.
Get the output size of the model.
Get recommended defaults.
Types
@type build_opt() :: {:hidden_size, pos_integer()} | {:num_tokenizers, pos_integer()} | {:tokenizer_vocab_sizes, [pos_integer()]} | {:tokenizer_embed_dims, [pos_integer()]} | {:num_layers, pos_integer()} | {:num_heads, pos_integer()} | {:num_kv_heads, pos_integer()} | {:dropout, float()} | {:window_size, pos_integer()}
Options for build/1.
Functions
Build a Mixture of Tokenizers model.
Options
:hidden_size- Output hidden dimension (default: 256):num_tokenizers- Number of parallel tokenizer pathways (default: 4):tokenizer_vocab_sizes- List of vocab sizes per tokenizer (default: [256, 512, 1024, 2048]):tokenizer_embed_dims- List of embedding dims per tokenizer (default: [32, 64, 128, 256]):num_layers- Number of transformer layers after fusion (default: 4):num_heads- Number of attention heads (default: 4):num_kv_heads- Number of KV heads for GQA (default: 2):dropout- Dropout rate (default: 0.1):window_size- Expected sequence length (default: 60)
Returns
An Axon model outputting [batch, hidden_size].
@spec output_size(keyword()) :: pos_integer()
Get the output size of the model.
@spec recommended_defaults() :: keyword()
Get recommended defaults.