Edifice.Robotics.OpenVLA (Edifice v0.2.0)

OpenVLA: Open-Source Vision-Language-Action Model.

OpenVLA is a vision-language-action model for robot manipulation that combines a DINOv2 vision encoder with a Llama-style language model backbone to predict discretized robot actions from images and text instructions.

Key Innovation: Action Tokenization

Instead of predicting continuous actions directly, OpenVLA discretizes each action dimension into bins (default 256), treating robot control as a sequence-to-sequence problem. The model autoregressively generates action tokens conditioned on visual and language tokens.

Architecture

Image [batch, C, H, W]           Text "pick up the red block"
      |                                    |
      v                                    v
+================+                  +================+
|   DINOv2 ViT   |                  |  LLM Tokenizer |
+================+                  +================+
      |                                    |
      v                                    v
Visual Tokens                       Text Token IDs
[batch, num_patches, vision_dim]   [batch, text_len]
      |                                    |
      v                                    v
[Vision-LM Projection]             [LLM Embedding]
[batch, num_patches, hidden_dim]   [batch, text_len, hidden_dim]
      |                                    |
      +---------------+--------------------+
                      |
                      v
            [Concatenate: visual | text]
            [batch, num_patches + text_len, hidden_dim]
                      |
                      v
            +====================+
            |  Llama-style LLM   |
            |  (Decoder-Only)    |
            |  with Causal Mask  |
            +====================+
                      |
                      v
            Action Token Logits
            [batch, action_dim, num_bins]
                      |
                      v
            Argmax -> Detokenize
                      |
                      v
            Continuous Actions
            [batch, action_dim]

Action Tokenization

For action dimension d with range [a_min, a_max]:

Bin index = floor((action - a_min) / (a_max - a_min) * (num_bins - 1))
Action = a_min + (bin_index / (num_bins - 1)) * (a_max - a_min)

Default: 256 bins per dimension, covering normalized range [-1, 1].

Usage

model = OpenVLA.build(
  image_size: 224,
  vision_encoder: :dino_v2,
  action_dim: 7,
  num_action_bins: 256,
  hidden_dim: 2048,
  num_layers: 24,
  num_heads: 16
)

# Forward pass: image + text -> action logits
action_logits = model.predict(params, %{"image" => img, "text_tokens" => tokens})

# Tokenize ground truth actions for training
action_tokens = OpenVLA.tokenize_actions(actions, 256)

# Compute loss (cross-entropy on action tokens only)
loss = OpenVLA.vla_loss(action_logits, action_tokens)

# Detokenize predictions for inference
actions = OpenVLA.detokenize_actions(predicted_tokens, 256)

References

Paper: "OpenVLA: An Open-Source Vision-Language-Action Model" (Kim et al., 2024) - https://arxiv.org/abs/2406.09246
Project: https://openvla.github.io/

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build an OpenVLA model.

detokenize_actions(action_tokens, num_bins \\ 256, opts \\ [])

Detokenize discrete bin indices back to continuous actions.

encode_image(image, opts)

Encode an image through the vision encoder (DINOv2-style ViT).

output_size(opts \\ [])

Get the output size of an OpenVLA model.

param_count(opts)

Calculate approximate parameter count for an OpenVLA model.

recommended_defaults()

Recommended default configuration for OpenVLA.

small_config()

Get small model configuration (for testing/prototyping).

tokenize_actions(actions, num_bins \\ 256, opts \\ [])

Tokenize continuous actions into discrete bin indices.

vla_loss(logits, target_tokens)

Compute VLA loss (cross-entropy on action tokens).

Types

build_opt()

@type build_opt() ::
  {:image_size, pos_integer()}
  | {:patch_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:vision_dim, pos_integer()}
  | {:hidden_dim, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_kv_heads, pos_integer()}
  | {:action_dim, pos_integer()}
  | {:num_action_bins, pos_integer()}
  | {:max_text_len, pos_integer()}
  | {:dropout, float()}
  | {:rope, boolean()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build an OpenVLA model.

Options

:image_size - Input image size, square (default: 224)
:patch_size - ViT patch size (default: 14)
:in_channels - Number of input channels (default: 3)
:vision_dim - Vision encoder output dimension (default: 384)
:hidden_dim - LLM hidden dimension (default: 2048)
:num_layers - Number of LLM layers (default: 24)
:num_heads - Number of attention heads (default: 16)
:num_kv_heads - Number of KV heads for GQA (default: 4)
:action_dim - Robot action dimension (default: 7)
:num_action_bins - Number of bins for action discretization (default: 256)
:max_text_len - Maximum text sequence length (default: 64)
:dropout - Dropout rate (default: 0.1)
:rope - Apply RoPE to attention (default: false)

Returns

An Axon model with inputs "image" and "text_tokens", outputting [batch, action_dim, num_action_bins] logits.

detokenize_actions(action_tokens, num_bins \\ 256, opts \\ [])

@spec detokenize_actions(Nx.Tensor.t(), pos_integer(), keyword()) :: Nx.Tensor.t()

Detokenize discrete bin indices back to continuous actions.

Converts bin indices to the center of each bin.

Parameters

action_tokens - Integer tensor of bin indices [batch, action_dim]
num_bins - Number of bins per dimension (default: 256)

Options

:action_min - Minimum action value (default: -1.0)
:action_max - Maximum action value (default: 1.0)

Returns

Float tensor of continuous actions with same shape as input.

encode_image(image, opts)

@spec encode_image(
  Axon.t(),
  keyword()
) :: Axon.t()

Encode an image through the vision encoder (DINOv2-style ViT).

Returns visual tokens without the DINO projection head.

Parameters

image - Axon node for image input
opts - Build options

Returns

Axon node with shape [batch, num_patches, vision_dim].

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output size of an OpenVLA model.

param_count(opts)

@spec param_count(keyword()) :: pos_integer()

Calculate approximate parameter count for an OpenVLA model.

recommended_defaults()

@spec recommended_defaults() :: keyword()

Recommended default configuration for OpenVLA.

small_config()

@spec small_config() :: keyword()

Get small model configuration (for testing/prototyping).

tokenize_actions(actions, num_bins \\ 256, opts \\ [])

@spec tokenize_actions(Nx.Tensor.t(), pos_integer(), keyword()) :: Nx.Tensor.t()

Tokenize continuous actions into discrete bin indices.

Discretizes each action dimension into num_bins uniform bins.

Parameters

actions - Continuous actions tensor [batch, action_dim] or [action_dim]
num_bins - Number of bins per dimension (default: 256)

Options

:action_min - Minimum action value (default: -1.0)
:action_max - Maximum action value (default: 1.0)

Returns

Integer tensor of bin indices with same shape as input.

vla_loss(logits, target_tokens)

@spec vla_loss(Nx.Tensor.t(), Nx.Tensor.t()) :: Nx.Tensor.t()

Compute VLA loss (cross-entropy on action tokens).

Parameters

logits - Predicted action logits [batch, action_dim, num_bins]
target_tokens - Ground truth action tokens [batch, action_dim]

Returns

Scalar loss tensor.