OpenVLA: Open-Source Vision-Language-Action Model.
OpenVLA is a vision-language-action model for robot manipulation that combines a DINOv2 vision encoder with a Llama-style language model backbone to predict discretized robot actions from images and text instructions.
Key Innovation: Action Tokenization
Instead of predicting continuous actions directly, OpenVLA discretizes each action dimension into bins (default 256), treating robot control as a sequence-to-sequence problem. The model autoregressively generates action tokens conditioned on visual and language tokens.
Architecture
Image [batch, C, H, W] Text "pick up the red block"
| |
v v
+================+ +================+
| DINOv2 ViT | | LLM Tokenizer |
+================+ +================+
| |
v v
Visual Tokens Text Token IDs
[batch, num_patches, vision_dim] [batch, text_len]
| |
v v
[Vision-LM Projection] [LLM Embedding]
[batch, num_patches, hidden_dim] [batch, text_len, hidden_dim]
| |
+---------------+--------------------+
|
v
[Concatenate: visual | text]
[batch, num_patches + text_len, hidden_dim]
|
v
+====================+
| Llama-style LLM |
| (Decoder-Only) |
| with Causal Mask |
+====================+
|
v
Action Token Logits
[batch, action_dim, num_bins]
|
v
Argmax -> Detokenize
|
v
Continuous Actions
[batch, action_dim]Action Tokenization
For action dimension d with range [a_min, a_max]:
- Bin index = floor((action - a_min) / (a_max - a_min) * (num_bins - 1))
- Action = a_min + (bin_index / (num_bins - 1)) * (a_max - a_min)
Default: 256 bins per dimension, covering normalized range [-1, 1].
Usage
model = OpenVLA.build(
image_size: 224,
vision_encoder: :dino_v2,
action_dim: 7,
num_action_bins: 256,
hidden_dim: 2048,
num_layers: 24,
num_heads: 16
)
# Forward pass: image + text -> action logits
action_logits = model.predict(params, %{"image" => img, "text_tokens" => tokens})
# Tokenize ground truth actions for training
action_tokens = OpenVLA.tokenize_actions(actions, 256)
# Compute loss (cross-entropy on action tokens only)
loss = OpenVLA.vla_loss(action_logits, action_tokens)
# Detokenize predictions for inference
actions = OpenVLA.detokenize_actions(predicted_tokens, 256)References
- Paper: "OpenVLA: An Open-Source Vision-Language-Action Model" (Kim et al., 2024) - https://arxiv.org/abs/2406.09246
- Project: https://openvla.github.io/
Summary
Functions
Build an OpenVLA model.
Detokenize discrete bin indices back to continuous actions.
Encode an image through the vision encoder (DINOv2-style ViT).
Get the output size of an OpenVLA model.
Calculate approximate parameter count for an OpenVLA model.
Recommended default configuration for OpenVLA.
Get small model configuration (for testing/prototyping).
Tokenize continuous actions into discrete bin indices.
Compute VLA loss (cross-entropy on action tokens).
Types
@type build_opt() :: {:image_size, pos_integer()} | {:patch_size, pos_integer()} | {:in_channels, pos_integer()} | {:vision_dim, pos_integer()} | {:hidden_dim, pos_integer()} | {:num_layers, pos_integer()} | {:num_heads, pos_integer()} | {:num_kv_heads, pos_integer()} | {:action_dim, pos_integer()} | {:num_action_bins, pos_integer()} | {:max_text_len, pos_integer()} | {:dropout, float()} | {:rope, boolean()}
Options for build/1.
Functions
Build an OpenVLA model.
Options
:image_size- Input image size, square (default: 224):patch_size- ViT patch size (default: 14):in_channels- Number of input channels (default: 3):vision_dim- Vision encoder output dimension (default: 384):hidden_dim- LLM hidden dimension (default: 2048):num_layers- Number of LLM layers (default: 24):num_heads- Number of attention heads (default: 16):num_kv_heads- Number of KV heads for GQA (default: 4):action_dim- Robot action dimension (default: 7):num_action_bins- Number of bins for action discretization (default: 256):max_text_len- Maximum text sequence length (default: 64):dropout- Dropout rate (default: 0.1):rope- Apply RoPE to attention (default: false)
Returns
An Axon model with inputs "image" and "text_tokens", outputting
[batch, action_dim, num_action_bins] logits.
@spec detokenize_actions(Nx.Tensor.t(), pos_integer(), keyword()) :: Nx.Tensor.t()
Detokenize discrete bin indices back to continuous actions.
Converts bin indices to the center of each bin.
Parameters
action_tokens- Integer tensor of bin indices[batch, action_dim]num_bins- Number of bins per dimension (default: 256)
Options
:action_min- Minimum action value (default: -1.0):action_max- Maximum action value (default: 1.0)
Returns
Float tensor of continuous actions with same shape as input.
Encode an image through the vision encoder (DINOv2-style ViT).
Returns visual tokens without the DINO projection head.
Parameters
image- Axon node for image inputopts- Build options
Returns
Axon node with shape [batch, num_patches, vision_dim].
@spec output_size(keyword()) :: pos_integer()
Get the output size of an OpenVLA model.
@spec param_count(keyword()) :: pos_integer()
Calculate approximate parameter count for an OpenVLA model.
@spec recommended_defaults() :: keyword()
Recommended default configuration for OpenVLA.
@spec small_config() :: keyword()
Get small model configuration (for testing/prototyping).
@spec tokenize_actions(Nx.Tensor.t(), pos_integer(), keyword()) :: Nx.Tensor.t()
Tokenize continuous actions into discrete bin indices.
Discretizes each action dimension into num_bins uniform bins.
Parameters
actions- Continuous actions tensor[batch, action_dim]or[action_dim]num_bins- Number of bins per dimension (default: 256)
Options
:action_min- Minimum action value (default: -1.0):action_max- Maximum action value (default: 1.0)
Returns
Integer tensor of bin indices with same shape as input.
@spec vla_loss(Nx.Tensor.t(), Nx.Tensor.t()) :: Nx.Tensor.t()
Compute VLA loss (cross-entropy on action tokens).
Parameters
logits- Predicted action logits[batch, action_dim, num_bins]target_tokens- Ground truth action tokens[batch, action_dim]
Returns
Scalar loss tensor.